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[57] ABSTRACT 

A pipelined x86 processor implements a method of detecting 
self-modifying code in which a prefetched block of instruc- 
tion bytes may contain an instruction that is modified by a 
store instruction preceding it in the execution pipeline. The 
processor includes a Prefetch unit having a multi-block 
prefetch buffer, a Branch unit with a branch target cache 
(BTC), and a Load/Store (LDST) unit having store reserva- 
tion stations. Self-modifying code is detected in three ways: 
(a) the Prefetch unit snoops store addresses from the LDST 
unit which are compared with (i) an address tag for each of 
the prefetch blocks of instruction bytes already loaded into 
the prefetch buffer, and (ii) the addresses of any pending 
prefetch requests, (b) the LDST unit snoops prefetch 
addresses issued by the Prefetch unit and compares them to 
store addresses queued in the store reservation stations, and 
(c) to ensure compatibility with the 486 specification for 
self-modifying code (which requires that a store that modi- 
fies an instruction be followed immediately by a jump to that 
instruction), the LDST unit detects when a store is followed 
by a COF that hits in the BTC which output a target address 
that is the same as the preceding store address. In particular, 
Prefetch unit snooping and LDST unit snooping detect 
instances of self-modifying code conditions that do not 
follow the 486 specification. 

11 Claims, 12 Drawing Sheets 
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DETECTING SELF-MODIFYING CODE IN A 
PIPELINED PROCESSOR WITH BRANCH 
PROCESSING BY COMPARING LATCHED 
STORE ADDRESS TO SUBSEQUENT 
TARGET ADDRESS 

CROSS REFERENCES 

This is related to commonly assigned, U.S. patent appli- 
cations: (1) Ser. No. 08/572,948, now abandoned titled 
"Prefetch Unit With A Three -Block Prefetch Buffer And 
Virtual Buffer Management Including Selectively Allocating 
A Prefetch Buffer For A Branch Target Or The Not-Predicted 
Path", filed Dec. 15, 1995, (2) Ser. No. 08/572,773, now 
U.S. Pat. No. 5,734,881 titled, "Detecting Short Branches In 
A Prefetch Buffer Using Target Location Information In a 
Branch Target Cache", filed Dec. 15, 1995, (3) Ser. No. 
08/572,815, now abandoned, titled, "Branch Target Cache 
Storing The LI Cache Index For A Target", filed Dec. 15, 
1995, and (4) U.S. Pat. No. 5,701,448, titled, "Detecting 
Segment Limit Violations For Branch Targets When The 
Branch Unit Does Not Supply The Linear Address", filed 
issued Dec. 23, 1997. 

BACKGROUND 

1. Technical Field 

The invention relates generally to computer systems, and 
more particularly relates to computer processors with 
prefetch and branch units that prefetch instructions, includ- 
ing prefetching predicted branch target addresses supplied 
by the branch unit. 

In an exemplary embodiment, the invention is used in an 
x86 processor to improve performance of prefetching and 
branch processing. 

2. Related Art 

Processors commonly use pipeline techniques to reduce 
the average execution time per instruction. An execution 
pipeline is divided into pipe stages — instructions arc 
executed in stages allowing multiple instructions to be in the 
execution pipeline at the same time. For example, current 
x86 processor architectures generally use the following pipe 
stages: 



tF Instruction Fetch (or Prefetch) 

ID Instruction Decode, including instruction length decode 

AC Address Calculation or Operand Access, including register file 

access, and for memory references, address calculation for operand 

load (either from cache or external DRAM) 
EX Execute, including arithmetic, logical, and shift operations 
WB Writeback of execution results, either writeback to the register file 

writeback or store to memory (cache or DRAM) 



In particular, to keep the pipeline full, a prefetcher fetches 
instruction bytes into a prefetch buffer — instruction bytes 
are transferred to a decoder for decoding into instructions for 
execution in later stages of the pipeline. As the prefetch 
buffer is emptied by the decoder, the prefetcher fetches 
additional instruction bytes either (a) by incrementing the 
prefetcher IP (instruction pointer), or (b) by switching the 
code stream in response to a change of flow instruction (such 
as a branch). 

Change of flow (COF) instructions interrupt the code 
stream, significantly impacting pipeline performance — 
COFs typically account for 15-30% of the instruction mix. 
For example, in the x86 instruction set architecture, COFs 
occur on the average every four to six instructions. COF 
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instructions include branches (including LOOP 
instructions), jumps, and call/returns— branches are condi- 
tional in that the branch may be taken or not taken 
(depending, for example, on the status of condition codes), 

5 while jumps and call/returns are unconditional (always 
taken). Taken branches and unconditional COFs (UCOFs) 
interrupt the code stream to cause instruction fetch to 
proceed from a target address. 
Without limiting the scope of the invention, this back- 

10 ground information is provided in the context of a general 
problem to which the invention has application: in a pipe- 
lined processor that executes the x86 instruction set, improv- 
ing performance and efficiency of prefetching and branch 
processing, and thereby the overall performance of the 
execution pipeline. 

The x86 instruction set architecture (ISA) allows variable 
length instructions. For the 32-bit and 64-bit x86 architec- 
tures (i.e., currently the 486, 586, and 686 generations), 
instructions can be from 1 to 15 bytes in length (the average 

2Q instruction is about 2.5 bytes). As a result, instructions will 
be misaligned in memory — typically, instruction length is 
decoded during the instruction decode stage of the execution 
pipeline. 

The goal of instruction prefetch is to provide a continuous 

25 code stream in the form of instruction bytes to the decoder 
(thereby maintaining a continuous flow of instructions for 
execution). Some 4866 generation microprocessor used a 
two-block prefetch buffer operated as a circular queue — a 
current block and used to buffer instruction bytes being 

30 delivered to the decoder, while the other block was used in 
prefetching the next block of instruction bytes. Prefetch 
performance is significantly impacted by COF instructions. 

The 486 generation microprocessors do not have a branch 
unit to provide dynamic prediction of branch direction — 

35 rather, branches are statically predicted not-taken and 
LOOPs are statically predicted taken. For branches, 
prefetching continues along the not-taken (fall through) 
path, and the execution pipe is flushed if the branch resolves 
taken in EX. For LOOPs, the prefetcher stalls until the target 

40 is felched during AC/EX. 

To improve pipeline performance on COFs, 586 and 686 
generation microprocessors have included branch process- 
ing units to predict the direction of branches, and in the case 
of predicted taken branches (and UCOFs), to switch the 

45 prefetcher to the target address immediately. Branch pro- 
cessing significantly reduces the instances in which the 
prefetcher and decoder are stalled due to a COF, which is 
particularly important from a pipeline performance stand- 
point as execution pipelines are lengthened (for example, by 

50 superpipelining a stage, such as address calculation, into two 
stages. 

A branch unit, includes a branch target cache (BTC) as 
well as branch prediction and branch resolution logic. When 
a branch is initially decoded and executed, then typically 

55 (based on the prediction algorithm), if the branch is taken, its 
target address is stored in the BTC as a predicted-taken 
branch (not-taken branches are typically not stored in the 
BTC) — the next time the branch is detected (during prefetch 
or decode), the BTC will supply the target address to the 

60 prefetcher. For each branch entry, the BTC typically stores 
(a) a tag identifying the branch instruction, (b) the associated 
predicted target address, and (c) one or more history bits 
used by the branch prediction logic — a conventional 
approach is to use as the BTC tag the address of the 

65 instruction prior to the COF to permit prefetching to switch 
to a predicted taken direction as this prior instruction and the 
COF instruction are decoding. 
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In particular, using the address of the instruction prior to 
the branch as the tag enables the BTC to be accessed, and a 
predicted-taken target address supplied to the pre f etcher, in 
the clock prior to decoding the branch instruction. In 
response to a hit in the BTC, the prefetcher switches the code 
stream in the next clock to the target direction, making the 
target instruction bytes available to the decoder immediately 
after decoding the branch instruction (assuming the prefetch 
target address hits in the cache) without stalling the execu- 
tion pipeline. 

The branch prediction logic implements a prediction 
algorithm based on the history bits stored with the corre- 
sponding branch entry in the BTC. The actual branch 
direction (taken or not -taken) resolves in EX in response to 
condition code update — if the branch is mispredicted, 
branch resolution logic repairs the execution pipeline. 
Repair of mispredicted branches involves terminating 
execution of the instructions in the mispredicted direction, 
restoring the state of the machine, and restarting execution 
from the correct instruction (including prefetching in the 
not-predicted direction) — a branch misprediction results in a 
branch misprediction penalty corresponding to the number 
of clocks lost by mispredicting the branch. 

Branch units typically store target addresses for all 
changes of flow— branches and as well as unconditional 
COFs (UCOFs) such as jumps and call/returns. In the case 
of UCOFs, no prediction is required, but the stored target 
address can be used to immediately switch prefetching to the 
target address (i.e., without waiting for the UCOF to be 
decoded). 

The x86 ISA supports both segmentation and paging, and 
allows self-modifying code. In 586 and 686 generation 
processors, using a branch unit to supply target addresses to 
the prefetcher, and increasing the depth of the execution 
pipeline, necessitates taking into account segment limit 
checking and detecting self -modifying code. 

Regarding segment limit checking, according to the 32-bit 
x86 memory management model (protected mode), 
addresses are generated using segmentation and, if enabled, 
paging. A code segment is defined by a segment base and 
segment limit both of which may be arbitrarily set in 
physical memory — a page is 4 Kbytes of physical memory. 
A segmented linear address (LA) is calculated by adding the 
segment base address to an offset (effective) address formed 
by adding two or three address components (relative base, 
displacement, and index) — this address is also the physical 
address (PA) if paging is not enabled. If paging is enabled, 
the physical address is obtained by translating the high order 
20 bits [31:12] of the linear address to obtain a page base 
address — the low order bits [11:0] provide a 4 Kbyte offset 
address within the page. Thus, the low order bits of the linear 
address and the translated physical address are the same. 

Each linear address calculation requires a segment limit 
check to determine if a linear address crosses the segment 
boundary. Separate code and data segments are defined — if 
the prefetcher crosses a code segment boundary, a segment 
limit violation exception is signaled. 

The prefetcher typically maintains the linear and physical 
address for the current prefetch address (memory aligned), 
as well as the associated code segment limit. For sequential 
prefetching, the prefetcher increments the physical address 
to generate the prefetch address to the cache, and increments 
the corresponding linear address to detect if the prefetch 
address crosses the segment boundary (instruction bytes 
beyond the segment limit are invalidated). 

The branch unit typically supplies physical target 
addresses to the prefetcher— when an entry in the BTC is 
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allocated for a branch instruction, the associated target 
address is the physical address obtained from the AC stage 
after linear address calculation and page translation. Sup- 
plying a physical target address allows the prefetcher to 

5 immediately begin prefetching (accessing the cache) with- 
out the necessity of translating a linear address. 

The target address supplied by the BTC is the address of 
the target instruction, which need not be memory aligned — 
the prefetcher or the cache logic will convert this target 

1Q address into a memory aligned prefetch address by ignoring 
the low order bits (for example, bits [4:0] for 16 byte cache 
lines). Thus, the branch unit may supply a target address that 
would cause the prefetcher to jump into a prefetch block 
(i.e., cache line) containing a segment limit — while the 

J5 prefetcher will have the physical prefetch address, it will not 
have the corresponding linear address to compare with the 
code segment limit (i.e., the target linear address is not 
generated until the COF instruction reaches the AC stage). 
As a result, the prefetcher may prefetch beyond the segment 

20 limit, which is contrary to the 486 specification. 

Regarding self- modifying code, the standard 486 speci- 
fication requires that a write instruction that modifies a 
"target" instruction be followed immediately by a jump to 
the modified target instruction — as a result, the target 

25 instruction is first modified by the write, and then fetched by 
the jump for execution. Not all 486 code follows this 
specification. 

For 586 and 686 generation architectures, maintaining 
compatibility with existing software that includes self- 

30 modifying code is made problematic by architectural 
changes that increase the likelihood that a write to an 
instruction will not complete before the instruction is 
fetched. Such architectural features include dynamic branch 
prediction, increased prefetch buffer size, and store reserva- 

35 tion stations (pre -cache write buffers). 

SUMMARY 

An object of the invention is to improve the performance 
of prefetching and branch processing, and therefore the 
^ overall performance of an execution pipeline, and more 
particularly, to ensure detection of self-modifying code even 
if it does not follow the 486 self-modifying code specifica- 
tion. 

These and other objects of the invention are achieved by 

45 a scheme for detecting selfmodifying code in a pipelined 
processor with branch processing. 

In one aspect of the invention, at least some self- 
modifying code is characterized by a store instruction that 
modifies a target instruction followed by a jump instruction 

50 to jump to such modified target instruction (i.e., the 486 
self-modifying code specification). A prefetch unit issues 
prefetch addresses for prefetch blocks of instruction bytes, 
and loads prefetch blocks into a prefetch buffer for transfer 
to a decoder. A branch target cache (BTC) that for each of 

55 selected COF (change-of-flow) instructions provides target 
address information used to generate a prefetch address for 
a prefetch block including a corresponding target address. 

Store control logic is responsive to a store instruction 
being decoded to latch the associated store address at least 

60 until the next instruction has completed decoding. The store 
control logic (a) detects whether the next instruction is a 
jump instruction that hits in the BTC such that the BTC 
supplies target information for the jump, and (b) compares 
the store address and the target address — if they match, the 

65 store control logic signals a code modification condition. 
In response to the code modification condition, the 
prefetch unit (i) flushes any instruction bytes in a corre- 
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sponding target prefetch block containing such target 
address, and (ii) re-issues a prefetch address for such target 
prefetch block after such store operation completes. 

In another aspect of the invention, the prefetch unit 
includes prefetch snoop logic that detects store addresses; — 
for each store address detected, the prefetch unit determines 
whether a prefetch block match exists between the store 
address and cither (i) an address included within a prefetch 
block for which a pending prefetch address has been issued 
but a corresponding prefetch block not yet stored in the 
prefetch buffer, or (ii) an address included within a prefetch 
block already stored in the prefetch buffer. 

For each store address for which a prefetch block match 
is detected, the prefetch unit (i) inhibits instruction bytes in 
the corresponding prefetch block from being transferred to 
the decoder, and (ii) re-issues a prefetch address for such 
prefetch block after the associated store operation is com- 
plete. 

In another aspect of the invention, the store control logic 
includes store reservation stations that queue pending store 
addresses for multiple store operations. Store snoop logic in 
the store control logic detects prefetch addresses issued by 
the prefetch unit — the store control logic compares such 
prefetch address with pending store addresses, and in the 
case of a match, signals a code modification condition. 

In response to the code modification condition, the 
prefetch unit (i) inhibits instruction bytes in the correspond- 
ing prefetch block from being transferred to the decoder, and 
(ii) re-issues a prefetch address for such prefetch block after 
the pending store operations are complete. 

Embodiments of the invention may be implemented to 
realize one or more of the following technical advantages. 
The invention may be used in a pipelined Processor with a 
Prefetch unit, Branch unit, and LDST (load/store) unit. 

The Prefetch unit and LDST unit cooperate to detect 
self-modifying code in three ways: (a) the LDST unit detects 
when a store instruction is followed by a jump that hits in the 
BTC and for which the store address and the jump target 
address are the same, (b) the Prefetch unit snoops store 
addresses issued by the LDST unit for comparison with the 
addresses of prefetch blocks in the prefetch buffer and with 
pending prefetch requests, and (c) the LDST unit snoops 
prefetch requests issued by the Prefetch unit for comparison 
with queued stores. The first detection scheme ensures 45 
compatibility with the 486 self-modifying code 
specification — the other two detection schemes ensure 
detection for code that does not follow that specification. 
This self -modifying code detection method enables early 
detection of self-modifying code conditions in a pipelined 
processor with branch processing and store buffering, 
thereby enhancing prefetching and execution pipeline per- 
formance and reducing the complexity of executing self- 
modifying code, with an attendant increase in computer 
system performance. 

The Prefetch unit includes a three-block prefetch buffer 
and uses virtual buffer management to logically allocate the 
physical buffer blocks as holding current, next, and previous 
prefetch blocks. For sequential decoding, virtual buffer 
management is used to operate the prefetch buffer as a 
circular queue. For branches, the logical previous buffer 
block can be assigned to store either a target prefetch block 
or a prefetch block including the not-predicted path of the 
branch, while the logical current and next buffer blocks are 
able to continue operation as a circular queue. The virtual 
bulfer management scheme increases performance of the 
Prefetch unit in delivering instruction bytes to a decoder, and 



thereby increase instruction throughput and overall com- 
puter system performance. 

The Branch unit includes a branch target cache (BTC) 
that, for each COF entry, stores as target information suffi- 
cient to identify a cache location, for example, the LI Cache 
index and way number, instead of the full target address 
(including the tag address used for tag comparison with the 
address of the prefetch block stored at that location. For a 
BTC hit, the exemplary LI Cache is accessed with the LI 
Cache index for set selection, with the way number being 
used to select a particular cache location in the set — the LI 
Cache returns the prefetch block (cache line) at that cache 
location along with the associated tag address. Caching the 
LI Cache index and way number (of other cache location 
identification information) represents a significant reduction 
in the number of bits stored in the BTC, thereby reducing die 
area required for the BTC, and allowing a reduction in 
overall die area or an increased allocation of die area to other 
processor modules, with an attendant decrease in computer 
20 system cost or increase in computer system performance. 
For a more complete understanding of the invention, and 
for further features and advantages, reference is now made 
to the Detailed Description of an exemplary embodiment of 
the invention, together with the accompanying Drawings, it 
being understood that the invention encompasses any modi- 
fications or alternative embodiments that fall within the 
scope of the claims. 
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DRAWINGS 

FIG. 1 illustrates an exemplary computer system includ- 
ing a processor and memory subsystem intercoupled over a 
processor bus, together with peripheral bus interface. 

FIGS, 2a and lb illustrate an exemplary processor includ- 
ing a prefetch unit and branch unit. 

FIG. 3 illustrates an exemplary prefetch unit, including 
the principal data, address, and control blocks, and the bus 
interconnections to the execution pipe, branch unit, load 
store unit, LI cache, and bus controller. 

FIG. 4a illustrates an exemplary prefetch buffer 
organization, including a three -block prefetch buffer and 
associated multiplexing/alignment logic and control logic. 

FIG. 4b illustrates the fields for an exemplary prefetch 
buffer block. 

FIGS. Sa-5c illustrate the placement of COF instructions 
within prefetch buffers logically allocated as PREV, CURR, 
and NEXT. FIG. 5a illustrates the case where indexing 
instruction N-l is located at the end of the current IB buffer 
IB2 and an associated COF is located at the beginning of the 
next prefetch block residing in IB buffer IB3. FIG. 5b 
illustrates the case where the indexing instruction N-l is 
misaligned in that it straddles two prefetch blocks IB2 and 
IB3. FIG. 5c illustrates the case where the COF is mis- 
aligned. 

FIG. 5d illustrates short COF conditions where a COF 
instruction and its associated target instruction reside in the 
prefetch buffers at the same time such that a prefetch block 
of instruction bytes containing the target instruction need not 
be prefetched. 

FIG. 6a illustrates an exemplary branch unit including 
branch target cache, return stack, and branch resolution 
buffer. 

FIG. 6b illustrates the organization of an exemplary 
branch target cache as 4-way set associative. 

FIG. 6c illustrates exemplary COF entries in the branch 
target cache including for each entry the LI cache index (set 
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number), byte location number, and way number for the 
cache line containing the target instruction in the LI cache, 
together with an IB_LOC field that when valid indicates the 
location within the prefetch buffer of a target for a short 
COF, and two history bits used in predicting the direction of 
conditional COFs. 

FIG. 7 illustrates the detection of short COF conditions, 
including setting the IB_LOC field in the branch target 
cache. 

FIGS. 8a and 86 illustrate timing diagrams respectively 
for reading and writing the branch target cache. 

FIG. 9 illustrates an exemplary return stack organization. 

FIG. 10 illustrates the two bit prediction algorithm for 
conditional COFs. 

FIG. 11 illustrates the organization of an exemplary 
branch resolution buffer. 

FIGS. \2a~\2c illustrates the exemplary virtual buffer 
management scheme using CURR/NEXT/PREV logical 



10 



15 



4.2.1. BTC Miss 

4.2.2. BTC Hit 
4.23. Return Stack 

4.3. Short COF 

4.4. Not-Predicted Path 

4.5. Mispredicted Branch 

5. Limit Checking 

5.1. Segment Limit Checking 

5.2. Page Limit Checking 

6. Detecting Self-Modifying Code 

7. Conclusion 

This organizational outline, and the corresponding headings, 
are used in this Detailed Description for convenience of 
reference only. 

The exemplary prefetch/branch unit organization is used 
to support prefetching and branch processing (including 
branch prediction) in a 586 generation processor. Detailed 
descriptions of conventional or known aspects of processor 
systems are omitted so as to not obscure the description of 



buffer allocations along with temporary TARG and 20 the invention. In particular, terminology specific to the x86 

N_PRED tag assignments respectively for (FIG. 12a) COFs processor architecture (such as register names, signal 

that miss in the branch target cache and are statically nomenclature, addressing modes, pinout definition, etc.) is 

predicted taken, (FIG. 126) COFs that miss in the branch known to practitioners in the processor field, as is the basic 

target cache and are statically predicted not-taken, and (FIG. design and operation of such processors and of computer 

12c) COFs that hit in the branch target cache and are 25 systems based on them. 



predicted taken. 

FIG. 13 illustrates the exemplary scheme for detecting 
segment limit violations in the case of COFs that hit in the 
branch unit. 

FIGS. 14a-14c illustrate the exemplary scheme for 
detecting self -modifying code using respectively (FIG. 14a) 
an implementation of the 486 jump/store specification for 
JMPs that hit in the BTC, (FIG. 146) Prefetch unit snooping 
of store addresses for comparison with pending prefetch 



requests and prefetch blocks already in the prefetch buffer 35 



When used with a signal, the # symbol designates a signal 
that is active low, while the / symbol designates the comple- 
ment of a signal. 
1. Computer System 
3D FIG. 1 illustrates an exemplary computer system, includ- 
ing a system or motherboard 100 with a Processor 200, 
memory subsystem 400, and system logic including system 
chipset 601 and datapath chipset 602. 

FIGS. 2a and 2b illustrate the exemplary x86 Processor 



blocks, and (FIG. 14c) LDST unit snooping of prefetch 
requests for comparison with stores queued in the LDST 
store reservation stations. 

DETAILED DESCRIPTION 

The detailed description of an exemplary embodiment of 
the scheme for detecting self-modifying code in a pipelined 
processor with branch processing and store buffering is 
organized as follows: 

1. Computer System 

1.1. System 

1.2. Processor 

2. Prefetch Unit 

2.1. Prefetch Buffer 

2.2. Buffer Control 

2.3. Prefetch Addressing 

2.3.1. Sequential Prefetching 

2.3.2. COFs 

3. Branch Unit 

3.1. Target Information 
3.1.1. Storing Cache Indices 
3.1.2 Short COFs 

3.2. BTC Access 

3.2.1. BTC Miss 

3.2.2. BTC Write 

3.3. Return Stack 

3.4. Branch Prediction 

3.5. Resolution 

4. Virtual Buffer Management 

4.1. Sequential Prefetch 

4.2. Normal COF 



1.1. System 

Referring to FIG. 1, motherboard 100 includes the Pro- 
cessor 200 interfaced to the memory subsystem 400 over a 
P-BUS (sometimes referred to as a CPU or local bus). The 

40 system logic includes, in addition to the system chipset 601 
and datapath chipset 602, an external clock source 604 
(which provides an external clock input to the Processor and 
system clock signals to the rest of the motherboard). 
For the exemplary computer system, the P-BUS is a 

45 conventional 486-typc 32-bit address and data bus. 

For the exemplary computer system, the only system 
elements that reside on the P-Bus are the Processor 200, 
memory subsystem 400, and the system and datapath 
chipsets 601 and 602. According to the exemplary division 

50 of system logic functions, the system chipset interfaces to a 
conventional 32-bit PCI peripheral bus, while the datapath 
chipset interfaces to a 16-bit ISA peripheral bus and an 
internal 8-bit X bus. 

Some current systems allow for a special VL-bus direct 

55 interface to the P-BUS for video/graphics and other periph- 
erals. 

For 32-bil systems with a 32 bit P-BUS, some current 
system logic designs combine the system and datapath 
chipset functions into a single chipset. For 64-bit systems 

60 with a 64-bit P-BUS, the pin count required by the 64-bit 
data bus width currently necessitates that the system and 
datapath chipset functions be split as indicated in FIG. 1. 

Processor 200 is coupled over the P-BUS to system 
DRAM (memory) 402 and L2 (level 2) cache 404— data 

65 buffers 406 control P-BUS loading by the system DRAM. 
The system chipset 607 includes P-BUS, DRAM, and L2 
cache control. 
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The datapath chipset 602 interfaces to the conventional X 
bus. The X bus is an internal 8-bit bus that couples to the 
BIOS ROM 702 and the RTC (real time clock) 704. In 
addition, a conventional 8 -bit keyboard controller 706 
resides on the X-bus. 5 

The system and datapath chipsets 601 and 602 provide 
interface control for the 16-bit ISA bus and the 32-bit PCI 
bus. The ISA bus maintains compatibility with industry 
standard peripherals, coupling to ISA peripheral card slots 
710. The PCI bus provides a higher performance peripheral to 
interface for selected peripherals, including coupling to PCI 
peripheral card slots 810 — in particular, a video/graphics 
card (including VRAM) 802 provides a video/graphics 
interface, while a storage controller 804 (which may be 
included as part of the system chipset) interfaces to storage is 
peripherals. 

The motherboard 100 couples through the PCI, ISA and 
X buses to external peripherals 900, such as keyboard 902, 
display 904, and mass storage 906. Network and modem 
interconnections are provided as ISA cards (but could be 20 
PCI cards). 

1.2. Processor 

Referring to FIG. 2a, exemplary Processor 200 is an x86 
processor that uses a modular architecture in which pipe- 
lined CPU core 202, LI (level 1) Cache 204, FPU (Boating 25 
point unit) 206, and Bus Controller 208 are interconnected 
over an arbitrated C-BUS. The CPU core interfaces to the 
C-BUS through Prefetch and Load/Store modules. The Bus 
Controller provides the interface to the external P-Bus. 

Referring to FIG. 2b, the Processor uses a six stage 30 
instruction execution pipeline: Instruction Fetch IF, Instruc- 
tion Decode ID, Address Calculation AC1/AC2, Execution 
EX, and Writeback WB. The superpipelined AC stage per- 
forms instruction operand access— register file access, and 
for memory reference instructions, cache access. 35 

Referring to FIG. 2a, CPU core 202 includes an execution 
core 210 that encompasses the ID, AC, EX, and WB 
execution stages. A Prefetch Unit 240 performs Instruction 
Fetch in conjunction with a Branch Unit 250, prefetching 
instruction bytes for Instruction Decode. A Load/Store unit 40 
260 performs operand loads and stores results for the AC, 
EX, and WB stages. A clock generator 270 receives the 
external system clock, and generates internal core and other 
clocks, including performing clock multiplication and 
implementing clock stopping mechanisms. 45 

Execution core 210 includes a Decode unit (ID) 211, an 
AC unit 212, and an EX unit 215. A Pipe Control unit 217 
controls the flow of instructions through pipe stages of the 
execution core, including stalls and pipe flushes. 

The EX unit is microcode controlled by a microcontrol 50 
unit 222 (microsequencer and microrom) and a general 
register file 224. The EX unit performs add, logical, and shift 
functions, and includes a hardware multiplier/divider. Oper- 
ands are transferred from the register file or Cache (memory) 
over two source buses SO and SI, and execution results are 55 
written back to the register file or the Cache (memory) over 
a writeback bus WB. 

Prefetch unit (PFU) 240 performs Instruction Fetch, 
fetching instruction bytes directly from the Cache 204, or 



tional changes of flow (UCOFs) (jumps and call/returns). 
The BU includes a branch target cache (BTC) 252 for 
branches and jumps/calls and a return stack RSTK (not 
shown) for returns — the BTC is accessed with the instruc- 
tion pointer for the instruction prior to the COF, while the 
RSTK is controlled by signals from ID 211 when a call/ 
return is decoded. For branches that miss in the BTC (which 
will then be statically predicted), the PFU will speculatively 
prefetch along the not-predicted path to enable prefetching 
to switch immediately in case the branch is mispredicted. 

The Decode unit (ID) 211 performs Instruction Decode, 
decoding one x86 instruction per clock. ID receives 8 bytes 
of instruction data from prefetch buffer 242 each clock, 
returning a bytes-used signal to allow the prefetch buffer to 
increment for the next transfer. 

Decoded instructions are dispatched to AC 212, which is 
superpipelined into AC1 and AC2 pipe stages, performing 
operand access for the EX stage of the execution pipeline. 
For memory references (reads or writes), the AC1 stage 
calculates one linear address per clock (address calculations 
involving four components require an additional clock), 
with limit checking being performed in AC2 — if paging is 
enabled, the AC2 stage performs linear- to-physical address 
translation through a TLB (translation lookaside buffer) 230. 
Instruction operands are accessed during AC2 — for non- 
memory references, the register file is accessed, and for 
memory references, the Cache 204 is accessed. 

The Cache is virtually indexed and physically tagged such 
that set selection is performed with the linear (untranslated) 
address available in AC1, and tag comparison is performed 
with the physical (translated) address available early in AC2, 
allowing operand accesses that hit in the cache to be 
supplied by the end of AC2 (the same as a register access). 
For accesses that miss in the Cache, cache control logic 
initiates an external bus cycle through the Bus Controller 
208 to load the operand. 

After operand access, the AC unit issues integer instruc- 
tions to the EX unit 220, and floating point instructions to the 
FPU 206. EX and the FPU perform the EX and WB stages 
of the execution pipeline. 

EX 220 receives source operands over the two source 
buses S0/S1 (a) as immediate data passed along with the 
instruction from AC 212, (b) from the register file 224, 
and/or for memory references, (c) from the Cache 204 or 
external memory through the Load/Store unit. In particular, 
for memory references that require an external bus cycle, EX 
will stall until operand load is complete. 

Execution results are written back in the WB stage either 
to the register file, or to the Cache (memory) — stores to the 
Cache (memory) are posted in store reservation stations in 
the Load/Store unit 260. 

Load/Store (LDST) unit 260 performs operand loads and 
result stores for the AC/EX units — in addition, for branches 
that miss in the BTC, the LDST unit issues prefetch requests 
for the target. Loads have the highest priority, except in the 
case of branches that miss in the BTC where the prefetch 
request for the target is given priority. Four reservation 
station buffers 262 are used for posting stores — stores can be 
posted conditionally pending resolution of a branch, retiring 



from external memory through the Bus Controller 208 — 60 only if the branch resolves correctly. Stores are queued in 



instruction bytes are transferred in 8 byte blocks to ID 211 
for decoding. The PFU fetches prefetch blocks of 16 instruc- 
tion bytes (cache fine) into a three-block prefetch buffer 242. 
A virtual buffer management scheme is used to allocate 
physical prefetch buffers organized as a circular queue. 

Branch unit (BU) 250 supplies prefetch addresses for 
COF instructions — predicted-taken branches and uncondi- 



65 



program order — operand loads initiated during AC2 may 
bypass pending stores. 

The LI (level one) Cache 204 is a 16K byte unified 
data/instruction cache, organized as 4 way set associative 
with 256 sets and 4 ways per set, with each way in each set 
constituting a location for a 16 byte (4 dword) cache fine 
(i.e., 256x4 cache lines). The Cache can be operated in either 
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write-through or write-back mode — to support a write-back The control block pf_adctl 306 basically handles the 

coherency protocol, each cache line includes 4 dirty bits control functions associated with the pf_dpath and 

(one per dword). p£_apath datapath blocks. It receives control information 

Bus Controller (BC) 208 interfaces to the 32-bit address from the C-BUS as well as ID 211, AC 212, and BU 

and data P-BUS, and to two internal buses — the C-BUS and 5 250 — this control information is converted to direct control 

an X-BUS. Alternatively, the BC can be modified to inter- of the multi-block prefetch buffer, as well as address 

face to an external 64-bit data P-BUS (such as the Pentium- manipulation functions in the two datapaths pf_dpalh and 

type bus). The BC includes 8 write buffers for staging pf_apath. 

external writes cycle. The control block p_cbusctl 308 includes the majority of 

The C-BUS is an arbitrated bus that interconnects the 10 state information associated with the Prefetch unit, as well 

execution core 210, Prefetch unit 240, LDST unit 260, as the C-BUS interface control. Pf_cbusctl generates 

Cache 204, FPU 206, and the BC 208 — C-BUS control is in prefetch requests to the C-BUS, and controls the sequencing 

the BC. The C-BUS includes a 32 bit address bus of these requests as they are satisfied. In particular, this 

C_ADDR, two 32-bit data buses C_DATA and control block tracks COFs down the execution pipe, using 

C_DDATA, and a 128-bit (16 byte cache line) dedicated 15 information from BU 250, ID 211, and AC 212. 

instruction bus. C_DATA and C_J3DATAcan be controlled PFU 240 and BU 250 cooperate to reduce the impact of 

to provide for 64 bit transfers to the FPU, and to support COFs on the prefetch operation, including (a) for branches 

interfacing the Cache to a 64-bit external data bus. In that miss in the BTC, buffering the not predicted path, and 

general, for 32-bit data transfers, the C_DATA bus is used (b) for short COFs (including short LOOPs) for which the 

for loads coming from off-chip through the BC to the LDST 20 target is already in the prefetch buffer, switching to the target 

unit, the Cache, and/or the Prefetch Unit, and the address without generating a prefetch request. For branches 

C_DDATA bus is used for stores into the Cache or external that miss in the BTC, the PFU allocates a prefetch buffer to 

memory through the BC. For instruction fetch misses, both the predicted and not-predicted paths, allowing the 

instruction data is provided over the C_DATA bus to the PFU to immediately switch the code stream if the branch 

Prefetch unit at the same time it is provided to the Cache. 25 resolves mispredicted — branches that miss in the BTC are, 

The X-bus is an extension of the external bus interface (a) in the case of conditional jumps, predicted not-taken, and 

that allows peripheral devices to be integrated on chip. (b) in the case of LOOPs, predicted taken. For short COFs 

2. Prefetch Unit (forward or backward branches or UCOFs), the BTC stores 

FIG. 3 illustrates an exemplary Prefetch unit (PFU) 240 information indicating that the target address is already in 

that implements the instruction fetch (IF) stage of the 30 one of the prefetch buffers, and available for transfer to ID 

execution pipeline for the exemplary Processor described in 211, obviating a prefetch operation to retrieve the target. 

Section 1.1 in connection with FIG. 2a. The PFU includes For the exemplary embodiment, COFs are indexed in BU 

four principal blocks: two data path blocks pf_dpath 302 250 with the instruction pointer for the instruction prior to 

and pf_apath 304, and two control blocks pf_adctl 306 and the COF to allow the prefetch request resulting from a BTC 

pf__cbusctl 308. Pf_dpath 302 includes the multi-block 35 hit to be generated early enough to prefetch the target cache 

prefetch buffer 242. line in time for the target instruction bytes to be ready for 

Dedicated buses interconnect the PFU to Branch unit transfer to the Decoder without any branch delay penalty. 

(BU) 250, Decoder (ID) 211 and AC 212 (in execution core In addition, PFU 240 includes mechanisms (a) to ensure 

210), Load/Store (LDST) 260, Cache 204, and Bus Con- that a segment boundary is not crossed when BU 250 

troller (BC) 208. In particular, the PFU interfaces to the 40 supplies the prefetch target address (see, Section 5), and (b) 

C-BUS for issuing prefetch addresses (C_ADDR) and to detect self-modifying code that does not conform to the 

receiving prefetch instruction bytes (either C_J DATA or 486 self -modifying code specification (see, Section 6). 

C_DATA). 2.1. Prefetch Buffer 

The two data path blocks pf_dpatb 302 and pf_apath 304 FIG. 4a illustrates the exemplary prefetch buffer organi- 

arc described in Sections 2.1 and 2.2. In general, pf__dpatb 45 zation of the pL_dpatb block 302, which includes three- 

302 includes the 48-byte multi-block prefetch buffer 242, block prefetch buffer 242, mux/align logic 312, and byte 

organized as three 16-byte (cache line) prefetch buffer pointer control logic 314 in the pf_dpath block 302. Mux/ 

blocks — a virtual buffer management scheme is used to align logic 312 includes a 5:1 multiplexer 316, and an 

allocate the three prefetch buffer blocks (see, Section 4). aligner 317 — the aligner is controlled by pointer control 

Pf_dpath supplies up to 8 instruction bytes to the decoder 50 314, which is responsive to a bytes-used shift signal id shift 

211 each clock cycle — these bytes are shifted and aligned from the decoder (ID 211 in FIG. 3) to shift the pointer to 

based on the instruction boundary of the last instruction a new initial instruction byte for the next 8 byte transfer, 

decoded. Thus, unless the prefelcher is stalled, in each clock, up to 8 

Pf_apath 304 generates prefetch requests over the instruction bytes are staged in latch 318 for transfer to the 

C_ADDR bus, either by incrementing the last prefetch 55 decoder. 

address (sequential prefetching), or by submitting a COF The prefetch buffer 242 includes three buffers IBO, IB1, 

target address— (a) if the COF hits in BU 250, the BU will and IB2, each staging a 16-byte (cache line) block of 

supply the target address to the PFU which will generate a instruction bytes. That is, in each IB buffer, the instruction 

prefetch request, or (b) if the COF misses, AC 212 will bytes are cache aligned on 16-byte boundaries, with each 

supply the target address to LDST 260 which will initiate a 60 prefetch request generated by the PFU or the LDST unit 

prefetch request. In response to a prefetch address, either the returning a 16-byte block of instruction bytes that is routed 

Cache 204 returns the corresponding prefetch block of 16 to a selected one of the IB prefetch buffers, 

instruction bytes over the 128-bit dedicated instruction bus FIG. 4b illustrates the fields for each of the IB prefetch 

C_IDATA, or, if the cache request misses, BC 208 runs an buffers — portions of a buffer are physically located in either 

external cache line fill cycle (4-dword burst cycle) and 65 the pf__dpath or pf_apath blocks of the PFU. In the 

returns the instruction bytes over the 32-bit C_DATA bus (4 pf_dpath block (302 in FIGS. 3 and 4a), aside from the data 

dword transfers). block of 1 6 instruction bytes (1 28 bits), each instruction byte 
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has associated with it valid (V) and segment-valid (SV) bits decoding — source selection signals are provided by 

(16 bits per field to allow one-hot interrogation). The valid pf_adctl control 306 to select (a) one of the three IB 

bits indicate the validity of the data. The segment-valid bits prefetch buffers, (b) one of the two data buses, or (c) from 

are used to delineate the exact location of a segment bound- a combination of these sources. 

ary within the 16-byte block — if cleared for a given byte, 5 Aligner 324 aligns and selects the instruction bytes for the 

that byte and the following bytes of the block are past the next 8 byte transfer to ID 211. Pointer control 314 outputs 

boundary aod therefore invalid. In addition, a segment-limit an index pointer to aligner 324 designating the index (initial) 

(SL) bit and a page-limit (PL) bit indicate that a segment byte of the next 8 bytes of instruction data to be transferred, 

boundary or page boundary is present within that buffer Thus, in the case of transfers from the IB prefetch buffers, 

block (page-valid bits are not needed because page bound- 10 pf_adctl control 306 determines which of the three prefetch 

aries are always aligned with block boundaries). buffers IBO, B31, IB2 is current (based on the virtual buffer 

Two other prefetch buffer fields reside in the pf_apath management scheme discussed in Section 4), and pointer 

block (304 in FIGS. 3 and 4a) and an prefetch address tag control 314 indicates which byte within this prefetch block 

and associated valid bit AV. Each IB buffer includes a 28-bit is the index byte. 

prefetch address tag representing the memory-aligned physi- 15 An 8-byte latch 326 latches the 8 byte transfer, and holds 

cal prefetch address that fetched the prefetch block into the the data valid on the ib_bus to ID 211. Valid bits for each 

buffer — these bits are bits [31:4] of the 32-bit prefetch byte lane are generated and delivered along with the instruc- 

address (bits [3:0] are not used in identifying the 16-byte tion bytes. 

cache line). For each prefetch address request issued over When executing sequentially, for each 8 byte transfer, 

the C_ADDR bus, the pf_apath block maintains a copy in 20 PFU 240 calculates the incrementation of the pointer value 

a register (not shown)—the prefetch address tag field is based on the number of bytes used in decoding the current 

loaded from this register into the appropriate IB buffer when instruction (1-8). The Decoder provides a bytes-used value 

the prefetch block returns (AV is then set to indicate that the via the id_shift lines to pointer control 314 to set up for the 

address tag is valid). next 8 byte transfer. 

FIGS. 5a-Sd illustrate, for the multi-block prefetch buffer 25 If a COF is encountered, the target address is supplied to 

242, the juxtaposition of COFs and related variable length the PFU — and a pointer for the target address within an 

x86 instructions stored in the IB buffers IB1-IB3. As a associated 16-byte prefetch block (cache line) are supplied 

preceding instruction N-l is decoding, the BTC will be to the PFU — the lower bits [3:0] of the target address 

accessed and, if the access hits, the corresponding target constitute a pointer for the target address within a 16-byte 

address for the COF will be fetched, or in the case of short 30 prefetch block (i.e., identifying its location within the 

COFs, identified as already in one of the IB buffers, thereby prefetched cache line). The target address is obtained from 

enabling the code stream to switch to the target direction in either (a) BU 250 over the pf_idpip bus, or if the BU access 

the clock after the COF instruction decodes (assuming that, misses, (b) LDST 260 over C_ADDR (the target address is 

if a fetch is required, it hits in the Cache). Thus, the target supplied to the LDST unit by AC 212). For short COFs that 

instruction will follow the COF instruction into the execu- 35 hit in the BU, the BU supplies an IB buffer tag identifying 

tion pipeline without introducing a bubble (i.e., without the IB buffer in which ihe target address is already located 

stalling the decoder). along with the pointer (see, Section 4.3). 

FIGS. Sa-Sc illustrate situations in which two prefetch If both the predicted and not-predicted paths of a branch 

blocks, and therefore to IB buffers, are required to obtain are staged in the prefetch buffer 242 (as is the case for 

both the indexing instruction N-l and the COF. In FIG. 5a, 40 branches that miss in the BTC), index pointer control 314 

the indexing instruction N-l is located at the end of the maintains a pointer for each path. 

current IB buffer IB2 and an associated COF is located at the Referring to FIGS. 4a and 4b, each clock the PFU 240 

beginning of the next prefetch block residing in IB buffer provides to the BU the physical IP (instruction pointer) for 

IB3. In FIG. 5b, the indexing instruction N-l is misaligned the initial instruction byte of the current 8 byte transfer, 

in that it straddles two prefetch blocks IB2 and IB3. In FIG. 45 which is formed by the prefetch address tag for the current 

5c, the COF is misaligned. IB buffer (i.e., the current prefetch block) together with the 

FIG. 5d illustrates short COF conditions. Both the index- 4-bit (one of 16) current initial byte pointer maintained by 

ing instruction N-l and the COF reside in the current pointer control logic 314 in the pf_dpath block 302. When 

prefetch block in IB buffer IB2. For a short COF, the target ID 211 completes decoding an instruction, it signals the Pipe 

of the COF resides in any one of the IB buffers IB1-IB3. 50 Controller (217 in FIG. 2a) which in turn signals the BU— in 

either forward or backward relative to the COF. response, the BU will latch the next IP from the PFU as the 

2.2. Buffer Control physical address for the initial byte of the next instruction to 

Referring to FIGS. 3 and 4a, the IB prefetch buffer 242 is be decoded, and use this IP address for accessing the BTC 

controlled by the pf_dpath and pf_adctl blocks 302 and and RSTK. 

306: (a) pf_adctl 306 controls buffer loading during 55 Referring to FIGS. 2 and 4a, the PFU IP address is 

prefetch operations, and buffer unloading for decode supplied to BU 250 via the pf_idpip lines for BTC lookup, 

operations, implementing a virtual buffer management If a BTC hit occurs, the BTC supplies the prefetch address 

scheme as described in Section 4, while (b) pf__dpalh 302, (see, Section 3.2). 

and in particular index pointer control logic 314, controls the 23. Prefetch Addressing 

mux/align logic 316 to select the appropriate 8 bytes for the 60 Referring to FIGS. 3 and 4a, prefetch addresses can be 

next transfer to the Decoder. generated by PFU 240 or LDST unit 260. The PFU generates 

Mux/align logic 316 receives instruction bytes from the prefetch addresses for sequential prefetching, and for COFs 

three IB prefetch buffers IB1-IB3, and the two data buses detected by BU 250 (BTC or RSTK hit). The LDST unit 

C_JDATA aod C_DATA, and multiplexes and aligns this generates prefetch addresses for COFs that are not detected 

data for delivery to ID 211 over an 8-byte ib_bus [63:0]. 65 by the BU 250. 

The initial stage of this logic is a 5: 1 multiplexer 322 which In response to a prefetch request, the prefetch buffer 242 

selects the source of the instruction data to be transferred for can be loaded with data from two sources— the dedicated 
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instruction bus C_IDATA [127:0] or C_DATA [31:0]. 
When a prefetch request hits in the Cache 204, it supplies the 
16-byte cache line over the C_IDATA bus in a single 
128-bit transfer. If the prefetch misses in the Cache, the bus 
controller runs an external cache line (burst) fill cycle, and 
returns the 16 instruction bytes (4 dwords) over the 
C_DATA bus to both the PFU and the Cache. 

Data from two distinct prefetches can be loaded simulta- 
neously into two of the three IB buffers as long as they return 
data over different buses. For branches, this feature allows 
both the predicted and not-predicted paths to be fetched and 
buffered simultaneously. 

In the PFU, prefetching is controlled by the pf_apath and 
pf_adctl blocks 304 and 306. The pf_apath block 304 is 
basically a 32-bit datapath that (a) for sequential PFU 
prefetching, issues prefetch addresses over the C_ADDR 
bus to both the Cache 204 and Bus Controller 208, and (b) 
for prefetches initiated by LDST 260, receives addresses 
from that same bus to perform comparisons and update 
registers. Pf_apath also (a) submits physical IP addresses 
for the instruction currently being decoded to BU 250 over 
the pf_ifa bus for BTC lookup, and (b) for BTC/RSTK hits, 
receives prefetch target addresses/information back from the 
BU over the same bus. 

Both the pf_ifa and the C_ADDR buses transfer physical 
addresses. The pf_apath block 304 does maintain linear 
addresses for segment limit checking and other various 
functions — these linear addresses can be delivered to the 
TLB 225 via the pf_Jabus for translation, and can be 
updated from AC 212 via the ac_Jabus. 

2.3.1. Sequential Prefetching 

For sequential prefetching, pf_apath 304 performs five 
basic functions: (a) prefetch sequencing, (b) physical 
prefetch address request, (c) IB prefetch address tag 
comparison, (d) linear prefetch address maintenance, and (e) 
C_ADDR interface. 

Pf_apath 304 generates prefetch requests by providing a 
physical address to C_ADDR, along with associated 
attributes. The pf_apath block drives prefetch addresses 
onto C_ADDR— control and attribute information is driven 
onto the C-BUS by the pf_adctl block 306. 

Pf_apath 304 includes a physical prefetch request latch 
that holds the next physical prefetch address that will be 
placed on C__ADDR if no COFs are encountered in cither 
the current or the next IB prefetch block. For sequential 
prefetching, an incrementer adds 16 to the contents of this 
latch each time a fetch is sent out. For a COF, the physical 
target address of the COF (from BU 250 or LDST 260) is 
mux<d into the latch, and incremented in preparation for the 
next sequential prefetch. 

Because it is possible for two instruction fetches to be 
simultaneously outstanding, pf_apath 304 includes a second 
prefetch request latch to hold the second outstanding 
prefetch addresses. Both latches are loaded from the pf_tfa 
bus, or in the case of a prefetch request initiated by LDST 
250, from the C-BUS (i.e., the PFU sequences instruction 
fetches issued from the LDST). The two latches are operated 
as a 2-deep queue. 

Pf_apath 304 includes a retry latch to buffer the current 
prefetch address (from either the PFU or LDST) — a retry 
signal may be received late, after the source of the prefetch 
address is already corrupted. The retry latch is updated either 
from pf ifa or from the C-BUS as a request is issued. 

Pf_apatb 304 is also responsible for signaling when a 
segment boundary has been reached by sequential prefetch, 
as well as when sequential prefetching encounters a page 
boundary (see, Section 5). 
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The pf_apath block also includes logic to detect self- 
modifying code (see, Section 6). 
23.2. COFs 

Referring to FIGS. 3 and Aa, the PFU 240 provides 
physical prefetch (IP) addresses to the BU 250 over the 
pf_idpip bus for BTC/RSTK lookup. The pf_apath block 
provides the physical IP for the instruction currently being 
decoded, i.e., the instruction for which the pf_dpath block 
302 is currently delivering instruction bytes to ID 211. 

For BTC hits, the BU 250 drives the target address 
directly onto the pf_ifa bus, — the target address is gated by 
the ptladctl block 306 onto the C_ADDR bus as a prefetch 
request unless a BTC hit results from a short COF (i.e., a 
COF in which the target has already been prefetched into the 
15 prefetch buffer 242) in which case no prefetch request is 
generated (see, Section 4.3). Target addresses supplied by 
the BU are latched in a target address latch in pf_apath 
304 — because the PFU begins prefetching at die target 
address as the instruction prior to the COF is decoding, it is 
possible that the current value in the physical prefetch 
address latch will have to be incremented for another 
prefetch to complete decoding such prior instruction and/or 
to fetch the COF instruction. Thus, in FIGS. 5**-5c, if the 
PFU IP for the index instruction N-l in the current IB buffer 
IB2 results in a BTC hit such that the BU supplies (he target 
prefetch address, the next prefetch block will nonetheless 
have to be fetched into IB buffer IB3 to complete decoding 
the instruction N-l and/or to decode the COF. When the 
COF decodes, the target address (which has already been 
used either for prefetch or short COF detection) is incre- 
mented to the next memory aligned prefetch address and 
stored in the physical prefetch address latch in preparation 
for the next sequential prefetch request. 

For COFs that miss in the BU, after the COF is decoded 
by ID 211, AC 212 calculates the target linear address which 
is input to the TLB 230 — depending on whether paging is 
enabled, the TLB supplies a translated or untranslated target 
physical address to the LDST unit which generates a 
prefetch request. Recall that, for conditional COFs, the 
40 default prediction is (a) for branches, not-taken, and (b) for 
LOOPs, taken— in either case, LDST 260 generates a 
prefetch request for the target address which for branches 
will be the not-predicted path, and for LOOPs will be the 
predicted path. The PFU will buffer the not-predicted path 
which, for branches, will be the taken path prefetched by the 
LDST unit, and for LOOPs, will be the not-taken fall 
through path (see, Section 4.4). 

For prefetch addresses generated by LDST 260, AC 212 
supplies the linear address to the LDST unit in AC1, which 
then initiate the prefetch request to the Cache — the linear 
address is used for set selection, with the translated physical 
address being available from TLB 230 early in AC2 for tag 
comparison and hit/miss determination. If the prefetch 
request hits, the Cache output will be placed on C_IDATA 
at the end of AC2 (i.e., on clock after the COF is decoded) — 
the pf_ifa bus is bypassed when LDST 250 generates the 
prefetch request. 

If the BU 250 detects that a COF is mispredicted (cither 
statically or dynamically), then the BU will drive the not- 
predicted address, which may be the taken or not-taken 
(fall-through) address, onto the pfa_ifa bus. For each COF, 
the BU stores both the predicted (statically or dynamically) 
and not-predicted addresses in a branch resolution buffer 
(see, Section 3.5). 
3. Branch Unit 

FIG. 6a illustrates the Branch Unit (BU) 250, which 
includes the branch target cache (BTC) 252, a return stack 
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(RSTK)342, and a branch resolution buffer (BRB) 344. The cal (translated) address determines hit/miss, and for a hit 

BU stores target information used for prefetching target selects 1 of 4 ways as the cache location output. The output 

addresses both (a) for conditional COFs (branches), i.e., JCC cache line forms the prefetch block of instruction bytes 

and LOOPs, and (b) for unconditional COFs (UCOFs), i.e., returned to the Prefetch unit. 

JMPs (jump) a °d CAJJLs/RETurns. 5 For PFU initiated prefetch requests, the PFU supplies the 

BTC 252 and RSTK 342 store target address information physical address [31:4] (LI Cache index [11:4] and tag 

for recently used COFs. The BTC stores target address [31:12]) during IF. For LDST initiated prefetch requests, the 

information for both conditional and unconditional COFs, LDST unit supplies the LI Cache index bits [11:4] from the 

except RETurns(see, Section 3.1). The RSTK 342 stores the linear address available in AC1, and then supplies the tag 

physical target address for RETurn instructions (see, Section 10 bits [31:12] of the translated address available in AC2. 

3.3). 3.1.1.Storing Cache Indices 

UCOFs are always taken (in effect, a static prediction). FIG. 6c illustrates the BTC 252, and in particular the BTC 

When a JMP or CALL that misses in the BTC is decoded, entries. Each BTC entry includes bits [11:0] of the target 

prefetching switches to the target direction, and an entry is address comprising (a) a set number [11:4] formed by the LI 

allocated in the branch resolution buffer. The UCOF is 15 Cache Index, and (b) a byte location number [3:0]. The LI 

written into the BTC if/when it reaches the EX stage. cache index [11:4], together with a 2 bit way number, 

RETurns are handled by the RSTK. identify a particular cache location (set and way) assumed to 

For conditional COFs, the BU dynamically predicts the store the prefetch block containing target instruction — the 

COF direction for BTC hits — COFs that miss in the BTC byte number identifies 1 of 16 bytes that is the initial target 

252 are statically predicted. Regarding conditional COFs 20 instruction byte. 

that miss in the BTC, LOOP instructions are statically Finally, each entry includes a valid bit, a three-bit 

predicted taken and prefetching switches to the target path IB_LOC field that designates the IB buffer location for short 

(identified in the AC stage), while branches are statically COFs, branch history bits, and two attribute bits PWT(page 

predicted not-taken. In either case, LDST 260 commences write through) and PCD (page cache disable), 

prefetching in the taken direction once the target address is 25 In particular, the BTC stores only the LI Cache index 

available from the AC stage — for LOOPs, the LDST unit [11:4] portion of the target address, but not the correspond- 

prefetches the statically predicted taken direction (the PFU ing cache tag [31:12] portion. A prefetch access with the LI 

240 either has or will prefetch one prefetch block along the Cache index will enable set selection to be performed, but 

not-predicted not-taken direction), while for branches, the the cache tag will not be available for tag comparison to 

LDST unit prefetches the statically not-predicted taken 30 select the way in which the cache line containing the target 

direction. If/when these instructions reach the EX stage, instruction is stored — instead, the way number cached in the 

conditional COFs are allocated an entry in the BTC only if BTC along with the LI Cache index is used for way 

they resolve taken. selection, allowing the LI Cache to return a cache line in 

For conditional COFs that hit in the BTC 252, the response to the prefetch request. When the COF that resulted 

predicted path depends on the history information stored 35 in the BTC hit is executed, the Branch unit will determine 

along with the target information for that entry. Thus, once whether the prefetch request resulting from the BTC hit was 

a conditional COF is allocated into the BTC, it will remain successful in retrieving the target instruction (see, Section 

in the BTC until replaced even if its predicted path changes 3.2.1). 

from the statically predicted path based on its history. Storing LI Cache indices in the BTC, rather than the full 

When a COF is detected, either by a BTC hit or by the 40 target address including the cache tags, has a number of 

Decoder for a BTC miss or a RETurn, an entry is allocated advantages, including reducing BTC die area. Regarding die 

in BRB 344 (see, Section 3.5). In particular, for COFs area, caching LI Cache tags requires 14 bits — bits [11:0] 

predicted by the BU, the BRB 344 is used to resolve both the plus the 2 bit way number— versus 32 bits for the full target 

target address (available in the AC stage) and the target address, for an area reduction of 18x128 bits, 

direction (available in the EX stage), and to repair any 45 Note that portion of the target address stored in the 

mispredictions. exemplary BTC is the low order 12 bits [11:0] that are the 

Far COFs are never cached in the Branch unit (BTC or same for both the linear address and translated physical 

RSTK), because these require a segment load from memory. address. In other words, the BTC does not contain any 

3.1. Target Information portion of the upper 20 bits of either the linear or physical 

For the exemplary BTC, the target information used for 50 address, and in particular, does not supply the target linear 

prefetching target prefetch blocks containing target instruc- address for use in segment limit checking (see, Section 5). 

lions is the LI Cache index and way number that together Each BTC entry has associated with it a 7 bit tag. The 

define a particular cache location (containing a prefetch BTC is accessed with the PFU IP bits [11:0], with bits 

block cache line), rather than the actual target address (see, IP[4:0] selecting 1 of 32 entries from each of the 4 ways, and 

Section 3.1.1). In effect, the exemplary design assumes that 55 bits IP(11:5] providing the 7 bit tag that is compared to each 

the prefetch request issued in response to a BTC hit will hit selected entry to determine hit or miss, 

in the LI Cache. Because the exemplary BTC organization uses a 5-bit 

FIG. 6b illustrates the organization of the cache array for index and a 7-bit tag, aliasing is possible between instruc- 

thc LI Cache 204 (the tag array is similarly organized). The lions with IPs that arc identical for bits [12:0] but different 

LI Cache array is 16K 4-way set associative with 256 sets 60 in the upper 20 bits. Thus, a BTC hit may result from the IP 

and 4 ways per set defining 256 X 4 set/way locations each for an instruction that aliases with an the IP for an instruction 

storing a cache line of 16 bytes (4 dwords). that is prior to a COF — a BTC non-COF alias will be 

For prefetch addresses supplied by the PFU or LDST units detected when no COF is decoded. Allowing aliasing rep- 

(i.e., not the BTC), the LI Cache is accessed with a prefetch resents a design trade-off to reduce BTC. 

address [31:4] (the lower bits are ignored. An 8 bit LI Cache 65 3.1 .2. Short COFs 

index [11:4] selects 1 of 256 sets with 4 ways (cache The Prefetch unit and Branch unit cooperate to deteel a 

locations)— lag comparison using bits [31:12] of the physi- short COF condition in which the target instruction is 
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already in one of the three IB buffer blocks (see, FIG. 5c). 
By detecting a short COF, the Prefetch unit is able to switch 
to the target instruction without issuing a prefetch request for 
the prefetch block (cache line) containing the target. 

FIG. 7 illustrates the detection of a short COF condition, 5 
including the setting of the IB_LOC field in a correspond- 
ing BTC entry. The IB_LOC field (FIG. 6c) is a 3 bit 
one-hot field that selects one of the three IB buffers accord- 
ing to the logical allocations CURR/NEXT/PREV (see, 
Section 4.3). 10 

When a COF is first encountered, it misses in the BTC and 
is detected during decode. If the COF is resolved taken in the 
EX stage, an entry for the COF is allocated in the BTC. That 
entry includes the LI Cache index for the target instruction 
but as yet the IB_LOC field is not valid. 15 

The second time the COF is encountered, it hits in the 
BTC 252, which outputs target prefetch information (LI 
Cache index and way number). In response to the prefetch 
request, the LI Cache 204 returns the prefetch block con- 
taining the target (assuming a cache hit on a cache location 20 
containing the target), along with the corresponding cache 
tag. 

The branch resolution buffer 344 receives the low order 
bits [11:0] of the target address from the BTC 252 (LI Cache 
index and byte number), and the high order bits [31:12] of 25 
the target address from the LI Cache 204 (cache tag 
address). The target address is compared (351) with the 
prefetch address tags for each of the IB buffers IB1-IB3. If 
a valid IB buffer prefetch address tag matches (352) the 
target address (indicating that the prefetch block containing 30 
the target is already in the designated IB buffers), then a 
short COF condition is detected. 

When a short COF is detected, the logical corresponding 
BTC entry is updated (353) with the logical CURR/PREV/ 
NEXT IB buffer location. When the COF is encountered the 35 
third time, the correct IB buffer is used to source target 
instruction, and a prefetch request is inhibited. 

Note that, while the exemplary IB__LOC field designates 
the logical IB buffer CURR/NEXT/PREV containing the 
target, the Branch unit could supply merely an indication 40 
that the target is in the prefetch buffer, and the PFU could 
determine that which IB buffer contains the target. 

Avoiding unnecessary prefetch requests has a number of 
advantages, including saving power and enhancing perfor- 
mance because the C-BUS and LI Cache arc not accessed 45 
for the target prefetch. In particular, eliminating some 
prefetch requests allows the C-BUS bandwidth that would 
otherwise have been used by PFU 240 for the prefetch 
requests to be used for other purpose by the LI Cache 204, 
the BC 208, or the LDST 260. 50 

3.2. BTC Access 

Referring to FIGS. 3 and 6a, the BTC is accessed with the 
PFU IP address during PHI of the first clock of the Decode 
stage for the instruction prior to the COF, i.e., the BTC is 
accessed at the same time the PFU begins transferring to the 55 
Decoder instruction bytes for the instruction prior to the 
COF. The BTC determines hit or miss by the end of PHI. 

3.2.1. BTC Hit 

If the BTC access hits, the BTC signals hit and outputs the 
LI Cache index [11:4] (and associated 2 bit way number) for 60 
the target address, which is latched into the branch resolu- 
tion buffer (BRB) 344. The LI Cache index is gated onto the 
C-BUS by the pf_cbusctl block 308 in PFU 240, initiating 
a prefetch request to obtain the cache line that includes the 
target instruction. 65 

The LI Cache performs a cache lookup using the LI 
Cache index [11:4] for set selection, and outputting a target 
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cache line from the way location specified by the way 
number. In addition, the LI tag array is accessed to obtain 
the corresponding cache tag [31:12] for the target cache line. 
Both the target cache line and the cache tag are returned to 
the PFU/BU. 

The cache tag is copied into the BRB 344, and combined 
with the LI Cache index output from the BTC, such that the 
BRB stores a complete physical address [31:0] as a specu- 
lative target address. When the COF reaches the AC stage, 
the target linear address is calculated in AO and translated 
in AC2 to obtain the actual target physical address. The 
actual target physical address from AC is compared to the 
speculative target address stored in the BRB — a mismatch 
may occur if either (a) the actual target cache line was 
replaced in the LI Cache, or (b) if the actual target address 
was modified. 

If the actual target address and the BRB address 
mismatch, and the COF was predicted taken and resolves 
taken in the EX stage, then the corresponding BTC entry is 
invalidated during the WB stage, and a new target address is 
sent to the PFU (resulting in a 5 cycle target address 
mismatch penalty). If no mismatch occurs, indicating the 
prefetch resulting from the BTC his correctly retrieved the 
target cache line, the BU will update the prediction history 
bits, the IB buffer location bits, and LRU bits during the EX 
stage of the COF. 

Note that the prefetch request is generated before the 
instruction prior to the COF completes decoding. As a result, 
if the prefetch request hits in the LI Cache, the target cache 
line may be returned prior to the COF decoding, requiring 
the PFU 240 to allocate an IB buffer for the target cache line 
(see, Section 4.2). 

FIG. 8a illustrates the timing in the BTC for BTC hits. 

3.1.4. BTC Miss 

Referring to FIG. 6a, if a COF misses in the BTC, then an 
entry in the BTC will be allocated (a) for UCOFs, and (b) for 
conditional COFs (branches or LOOPs) that resolve taken. 

The COF will be decoded in the ID stage, and an entry 
allocated in the BRB. The target address is calculated during 
AC stage. If the COF resolves taken in the EX stage, and if 
the COF is cacheable, the target information is written into 
the BTC during the WB stage. The prediction history bits are 
set to weak taken state 3, predicting taken for the next 
iteration, the instruction buffer location bits are set to 0, and 
the LRU bits for the set are updated. 

If allocating the COF into the BTC requires replacing 
another entry, a pseudo-LRU algorithm is used to select the 
way for allocation/replacement in the selected set. If an 
invalid way exists, it will be chosen for allocation, otherwise 
if a predicted not-taken entry exists it will be chosen for 
replacement. Finally if all four ways in the set are valid and 
predicted taken, the least recently used entry will be 
replaced. 

FIG. 8b illustrates the timing in the BTC for BTC writes. 
3.3. Return Stack 

FIG. 9 illustrates the return address slack (RSTK) 342. 
The RSTK holds the predicted target addresses for RETurn 
instructions. 

Return addresses are pushed onto the stack when a CALL 
is decoded (whether or not the CALL hit in the BTC), with 

the return address being supplied over pf idpip. Predicted 

return addresses are popped off the stack when the RETura 
instruction is decoded, and output onto pf_ifa for use in 
prefetching. 

The exemplary RSTK 342 holds 8 entries. Each entry is 
a 32 bit physical return address — because the RSTK only 
holds 8 entries, storing actual return target addresses rather 
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than LI Cache indices as in the BTC does not result in a (a) for BTC hits on COFs dynamically predicted taken, 

significant die area penalty. TARG is assigned when the BTC hits designating the PRE V 

Note that the use of the return address stack permits the IB buffer to receive the target block prefetched by the 

Branch unit to supply the return target address the first time PFU/BU, or (b) for BTC misses, N_PRED is assigned when 

the RETurn is decoded, even though the associated CALL 5 the COF decodes to designate the PREV IB buffer to store 

will not be entered into the BTC. The second time the CALL the not-predicted path. Regarding N_PRED (a) for LOOPs 

is encountered, it will hit in the BTC which will supply the statically predicted taken, the CURR IB buffer, which holds 

CALL target prefetch information from which a prefetch the not-taken address, will be reallocated to PREV and then 

request for the CALL target will be generated. Operation of tagged N_PRED, and (b) for branches statically predicted 

the RSTK is the same regardless of whether the correspond- JQ not taken, the PREV IB buffer will be tagged N_PRED to 

ing CALL hits in the BTC. receive the not-predicted target block prefetched by 

3.4. Branch Prediction LDST— the N_PRED lagged IB buffer will be allocated as 

FIG. 10 illustrates the exemplary branch prediction strat- CURR if me C OF resolves mispredicted, 

egy. Branch (and LOOP) prediction uses two history bits ^ buffer management xhcmc avoids any xcd l0 

stored with each branch. The two bits encode the states 0 h icall transfer belwecn 5uffere 

through 3 as shown. 13 L. i , . . , , 

Stales 0 and 1 predict a branch not taken, and states 2 and 7** P^**? 1 , lo S ,c , 306 ™plcments the virtual punter 

3 predict a branch taken. The transitions between the states J* m B mternaUy maintained IB tags CURR NEXT, 

are determined by the actual behavior of the branch. and ™? v . as ™« * ^ 

When an entry is first cached in the BTC, its prediction PL_adctl maps the IB tags to a respective physical IB 

history bits are set to state 3 predicting taken on the next 20 prefetch block IB1-IB3, generating control signaling to the 

iteration. pL_dpath logic 302. Each IB tag comprises a bit vector 

3.5 Branch Resolution Buffer indicating a respective IB buffer IB1, IB2, or IB3. 

FIG. 11 illustrates the organization of the branch resolu- 41 Sequential Prefetch 

tion buffer (BRB) 344. The BRB is used to track change of Referring to FIGS. 3 and 4a, sequential prefetching 

flow instructions through the execution pipeline, and to 25 occurs unlcss thc codc strcam * interrupted by: (a) a COF 

resolve the target address and, for conditional COFs, the or exception, (b) self-modifying code, or (c) a segment limit 

target direction. In particular, conditional COFs are resolved or P a 8 e boundary. In sequential prefetching, the three IB 

based on the condition code results available during EX. buffers arc operated as a circular queue. 

BRB entries are allocated either in response to a BTC hit, A* instruction bytes are transferred from the CURR IB 

or in the case of a BTC miss, in response to the decode of 30 buffer t0 ID 211 for decoding, the byte pointer from pointer 

the COF. Because COF direction is not resolved until the EX control 314 increments in response to the id_shift signal 

stage, it is possible that one COF could be resolving in the until tne CURR Duffer * exhausted. At that point, pf_adctl 

EX stage, while a second COF is in the AC2 stage, a third 306 uses lhe virtual bufier management scheme to logically 

COF is in the AC1 stage, and a fourth COF is in the ID stage. reallocate the 1 B buffers— the I B tags are remapped such that 

Therefore, the BRB requires four entries to cover all cases 35 CURR-* PREV 

of COF instructions in the pipeline. NEXT— CURR 

The Entry field in the BRB indicates which entry in the while the next prefetch block is fetched into 

BTC that the COF instruction hit in so that the entry may be NEXT— PREV 

updated without having to access the BTC. The entry is Thai is, the new NEXT IB buffer (previously the PREV IB 

formed by the IP bits [4:0] which index into the BTC plus 40 buffer) is cleared, and the PFU outputs a prefetch request for 

the BTC way number to select a particular entry. the next sequential prefetch block. 

The Target Address field holds the physical target address Thus, the PFU attempts to prefetch at least one 16-byte 

for a COF that hit in the BTC. This value is retrieved from prefetch block ahead of the instruction bytes being trans- 

the cache tags returned by the LI Cache in response to a fcrrcd out of CURR to thc Decoder, 

prefetch request. This value is compared against the actual 45 Normally, the returned prefetch block is directed into thc 

target address (from AC) before the COF is resolved. NEXT buffer. However, situations can occur in which two 

The Not Taken Address field holds the address of the next IB buffers are empty when a prefetch block returns over the 

instruction following the COF. This value is sent to the PFU C-BUS (for example, if the prefetch request missed in the LI 

when a COF resolves not taken. Cache and was directed off-chip in an external bus cycle). In 

4. Virtual Buffer Management 50 this case, the returning prefetch block will be loaded by 

Referring to FIGS. 4a and 4b, PFU 240 employs a virtual pf_adctl 306 into the new CURR IB buffer, and a new 

buffer management scheme to control the allocation of prefetch request will be issued to load the new NEXT IB 

prefetched instruction bytes into the three prefetch buffers buffer. Another possibility is that when a COF is 

IBO, IB1, and IB2. Each of the IB prefetch buffers holds one encountered, two fetches in the two different paths of the 

16-byte, memory- aligned prefetch block (cache line) of 55 COF — predicted not-taken and not-predicted taken — may 

instruction bytes (with separate valid bits for each 4-byte be pending. 

dword in the block). To handle the situation in which two prefetch requests 

Thc physical IB prefetch buffers arc logically allocated as may be pending and two IB buffers available, the pf_adctl 

CURR, NEXT, and PREV buffers (or blocks), where: (a) logic 306 maintains two more IB tags — HFILL and MFILL, 

CURR is allocated to buffer the prefetch block from which 60 which function in a manner similar to the CURR, NEXT, 

instruction bytes are currently being transferred to the and PREV tags. The HFILL tag specifies where data coming 

decoder, (b) NEXT is allocated to buffer the prefetch block from a cache hit should be placed, and the MFILL tag 

that will next be reallocated as CURR (assuming sequential specifies where data coming from an external linefill should 

prefetching), and (c) PREV is allocated to the deallocated be placed. When a prefetch request is generated, the HFILL 

CURR block. 65 tag points to CURR or NEXT buffer, whichever one is to be 

In addition, in the case of a COF (other than a short COF), filled. If the LI Cache signals a hit, the HFILL tag directs the 

two temporary tags may be assigned to the PREV IB buffer: prefetch block from the C_IDATA bus to the designated 
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buffer — if the request misses in the LI Cache, the PFU waits logically reallocated CURR and NEXT as a two block 

until the first bus cycle of the external burst fill cycle, at circular queue, 

which time the MFILL tag is set to point to the IB buffer to 4.2.1. BTC Miss 

be fi|] CG » Referring to FIGS. 3 and 4a, for BTC misses, the target 

The MFILL pointer stays constant for the duration of the 5 address (available in AC2) will be prefetched by LDST 

burst fill cycle, regardless of whether the queue shifts and a 260— for UCOFs and LOOPs, the LDST prefetches the 

second prefetch request is generated. This tag is then used to statically predicted taken path, while for branches, the LDST 

control placement of data from the C_DATA bus into the prefetches the statically not-predicted taken path 

*™r«r,r/lt,. ro huflVr A* decode time, pf_adctl 306 reallocates the IB tags for 

™? rZ a . il v r t.uA . »■ , n the IB buffers IB1-IB3 to receive the target prefetch block. 

I Jhe Decoder £jdh waiting for fetched mstrucUon bytes, 10 pf _ cbuscU m ^ wheQ , t / rcfe F tcb fe 

the HFILL and MFILL tags are used to s^ultaneously route aborted _ for , an tion occurrin whe ^ gener . 

the incoming prefetch block from CJDATA or C DATA ali a address ^ abort the fclch 

data straight through the aligner 117 to ID 211, at the same from LDST. 

time the block is loaded into the appropriate IB buffer. For UCOFs, the IB buffers are simply cleared, and the 

4.2 Normal COFs 15 P pu stops prefetching to wait for the target fetch from 

A normal COF is a COF that is not a short COF, i.e., a LDST. No special IB reallocation is required, and the target 

COF for which the target address is not already in one of the prefetch block is loaded into the CURR IB buffer, followed 

IB buffers, and must be fetched. Virtual buffer management by a PFU prefetch of the next prefetch block into the NEXT 

for normal COFs involves: (a) completing decode of the IB buffer. 

COF, and (b) assigning the TARG or N_PRED tags to the 20 FIG. 12a and 12b illustrate the exemplary virtual buffer 

appropriate IB buffer to receive a prefetched predicted or management scheme for COFs that miss in the BTC. When 

not-predicted target. the COF decodes, the physical IB buffer logically allocated 

Referring to 5a-Sc, the CURR IB buffer IB2 contains all as PREV is assigned the N_PRED tag. 

or part of the N-l instruction previous to the COF, which Referring to FIG. 12a, if the static prediction for COF 

partially or completely resides in NEXT IB buffer IB3. That 25 direction is taken (LOOPs), then when the COF decodes the 

is, either the N-l instruction or the COF are misaligned such PFU discontinues transferring instruction bytes from the 

that both IB buffers are required to complete decode of the CURR IB buffer-the IB buffers are reallocated 

COF, with the IB buffers being reallocated CURR— PREV 

NEXT-CURR NEXT-* CURR 

PREV-NEXT 30 PREV-»NEXT 

CURR-*PREV CURR and NEXT are cleared — the LDST unit will issue a 
as the nT and COF instructions decode. Note that, the P*?f h ™J™f * *>' ! he ^ Prefetch block during AC 
physical IB buffer logically allocated as PREV will be which wJ1 b f C l ° k aded mt0 ^ URR ' and the Prefetch Uml ^ 
assigned TARG or N_PRED either at BTC hit time or, for J™ a P" fetch j equ fS ±5 the next SC£ l uentia! P refetch 
BTC misses, at the time the COF decodes — if logical buffer 35 °fjf ^"J t ^ f DDBV k m .u . 
reallocation is required to complete COF decodt, then the Alter IB buffer reallocation, PREV holds the not- 
physical IB buffer logically allocated as PREV will be P* dlcted (QO Waken) path an is assigned N PRED. The 
logically reallocated as NEXT, but the TARG or N_PRED C .OF moves t0 the EX . sta 8 e ' l ? the * xe ™ Uon 
tag wUl continue to point to that physical IB buffer (i.e., the P l £ me b * Actions »n the predicted-taken palh-if the 
new NEXT! 40 COb resolves mispredicted, N_PRbD is reallocated as 

Referring " to FIG. 3, when the N-l instruction starts CURR and ^ c *» no the not-predicted (not-taken) 

decoding, L DTC 242 is accessed with the N-l IP-if the P at D h are ^ ^nsferred to the Decoder, 

access hits, the PFU 240 will prefetch the target (unless the A . Referrin S 10 ™- ™* * ,be s ' alic P redlcll , on f ? r ^ 0F 

COF is a conditional COF predicted not-taken), while if the direction is not taken (branches) then sequential prefetching 

access misses, the LDST 260 will prefetch the target as 45 ^^7^°^ 

either the statically predicted or not predicted path of the P ^ V ™ U * tagg f d "~ PRED ' an h < H? T ™\ ^ 

COF. That is, the target prefetch block will be prefetched P rc , h h Tptn P ' 

unless a BTC hit results in a dynamic not-taken prediction. be '° aded ml ° ™^Sr. • a- , a .u m nncn 10 

For the exemplary virtual buffer management scheme, . ^ * a rZu^'fu* the N__PRED IB 

once the PREV IB buffer has been assigned either TARG or 50 buff " 15 aUocated f CURR ' and the Prcfelch umt switchcs 

N_PRED, the CURR IB buffer continues to deliver instruc- t0 , lhe oot-predicted target instruction, 

tion bytes to the decoder, and if necessary, the NEXT IB '° either ^./f 1 ^ static prediction for the condrtiona 

buffer can then be reallocated CURR to continue sequential J° F ,s corrcct ' ! he b " ffered TVTV* nol :P r * d,ct r ed 

rcfetchin so that* direction are cleared, and the IB buffer is available for 

^N^YT^r-HDi} 51 55 normal sequential prefetching. 

tNHAi— lukk Referring to FIG. 3, far COFs present an added compli- 

CURR-*PREV cation because, in many cases, the target prefetch request for 

PREV (TARG/N_PRED)— NEXT a far COF is issued a number of cycles before it completes. 

If the new CURR IB buffer is exhausted without completing As a result, updating machine state for the new segment is 

decode of the COF, then normal sequential prefetching 60 delayed, stalling instruction decode even though target 

proceeds with instructions may already have been fetched. Specifically, 

NEXT (TARG/N_PRED)-*CURR many far COFs are decoded as mode-change instructions— 

In this case, the TARG/N_PRED tag is invalidated, the new ID 211 signals the PFU 240 when a mode-change COF is 

CURR buffer is cleared, and a the Prefetch unit issues a decoded, and then stalls until EX signals that COF execution 

prefetch request for the next sequential prefetch block. 65 is complete even though the PFU is enabled to receive target 

Alternatively, while a physical IB buffer is assigned either instructions and even to prefetch beyond the target and 

TARG or N_PRED, the other two IB buffers may be completely fill the IB. 
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When an exception is signaled, the PFU clears the entire 
prefetch buffer 242 (IB1-IB3) and stalls prefetching— 
LDST260 issues the initial target prefetch request to retrieve 
the appropriate exception handling routine. Reset is treated 
similarly — upon reset, the PFU emerges waiting on a fetch 
from the LDST for the first block of instructions, and theD 
begins prefetching from that point. 

4.2.2.BTC Hit 

For BTC hits that result in a taken prediction, the PFU/BU 
will issue a target prefetch request as the instruction previous 
to the COF is decoding. As a result, if the target prefetch 
request hits in the LI Cache, the returned target prefetch 
block generally will have to be buffered for at least one cycle 
to allow the COF instruction to complete decoding. 

When the BTC hit is signaled, the PREV IB buffer is 
cleared, and pf_adctl assigns a TARG tag to identify the IB 
buffer that will receive the prefetch target block. At the same 
time, the bottom 4 bits of the target address are saved in the 
pointer control logic (314 in FIG. 3) to provide as the byte 
pointer for the target instruction within the target prefetch 
block. The PFU continues sequential prefetching into the 
CURR and NEXT IB buffers to insure that all bytes neces- 
sary to decode the ensuing COF instruction are fetched. 

FIG. 12c illustrates the exemplary virtual buffer manage- 
ment scheme for COFs that hit in the BTC. When the BTC 
signals hit based on the access with the IP for the instruction 
previous to the COF, the physical IB buffer logically allo- 
cated as PREV will be assigned the TARG tag— the Prefetch 
unit will issue a prefetch request for the target prefetch 
block. When the COF decodes, the TARG IB buffer (which 
may now contain the target prefetch block) will be allocated 
as CURR, and instruction byte transfer to the Decoder will 
commence with the target instruction. 

In issuing the target prefetch request, the PFU/BU only 
supplies the LI Cache index (plus the way number) for the 
target address. As a result, this target prefetch request is 
disallowed from going off-chip (the PFU aborts the request 
as it is issued). When the target prefetch request issues, the 
HFILL tag is set equal to the TARG tag to indicate a return 
from the LI Cache is expected. 

If the target prefetch request hits in the LI Cache, the 
returned prefetch block is loaded into the IB buffer desig- 
nated by the TARG tag. When the Decoder signals that the 
COF has finished decode, the CUR IB tag is set to the TARG 
tag, and NEXT and PREV IB tags arc reallocated accord- 
ingly. Using the TARG byte pointer, the PFU initiates an 
8-byte transfer to the Decoder commencing with the initial 
target instruction byte. The NEXT and PREV IB buffers are 
cleared, and the PFU commences sequential prefetching. 

If the target prefetch request misses in the LI Cache, the 
PFU will stall after the COF decodes. When the COF 
reaches EX and resolves, the BU provides the correct target 
address (assuming the COF is a UCOF or resolves taken) for 
a prefetch request. 

For the exemplary implementation, if a BTC hit is sig- 
naled but the PFU cannot issue a target prefetch request over 
the C-BUS before the instruction prior to the COF finishes 
decode (such as due to heavy C-BUS traffic), the PFU forces 
ID 211 to stall by disabling further instruction byte transfers. 
Stalling the ID insures that the COF instruction does not start 
decoding before the its target prefetch request issues, 
thereby preventing another BTB hit for a subsequent COF as 
the initial COF is decoding. 

Because the exemplary BTC organization allows aliasing 
(see, Section 3.1), a BTC hit may result from the IP for an 
instruction that aliases with an the IP for an instruction that 
is prior to a COR A BTC non-COF alias will be detected 
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when no COF is decoded, and the PFU will clear the TARG 
IB buffer — sequential execution will continue without inter- 
ruption using instructions already prefetched into the other 
blocks. 

4.2.3. RSTK 

Referring to FIGS. 3, 6a, and 8, the RSTK is accessed 
when a RETurn is decoded — a RSTK hit results in the 
RETurn target address being popped off the RSTK and 
supplied to the PFU. The RSTK supplies the full [31:0] 
physical target address, which the PFU uses to generate a 
RETurn target prefetch request. The operation is similar to 
a COF that misses in the BTC, except that the PFU is able 
to generate the prefetch request rather than waiting for the 
LSDT to initiate the prefetch request in the AC stage. 

Prior to issuing the RETurn target prefetch request, the 
PFU clears all IB buffers — any prefetch blocks returned in 
response to prior prefetch requests will be ignored. The 
returned target prefetch block is loaded into the CURR IB 
buffer (using the HFILL and MFILL tags), and the PFU will 
continue sequentially prefetching down the new path. 
4.3. Short COFs 

FIG. 5d illustrates short COF conditions in which the 
target of a COF has already been prefetched into one of the 
IB buffers IB1-1B3. The virtual buffer management scheme 
implements a mechanism for detecting short COF 
conditions, allowing the PFU to supply the target instruction 
to the Decoder on the next cycle, thereby avoiding a target 
prefetch request (with attendant power saving and C-BUS 
performance enhancement). 

Referring to FIGS. 4b and 6c, each entry in the BTC 
includes a three-bit 1B_L0C field that, when valid, stores 
the logical tag— CURR, NEXT, PREV— for the IB buffer in 
which the target instruction is located. As described in 
Section 3.1.2, the second time a short COF is encountered, 
the PFU compares the target physical address (i.e., the LI 
Cache index [11:0] from the BTC and the LI Cache tag 
[31:12] from the LI Cache) to the prefetch address tags in 
each of the IB buffers — if a match occurs, indicating that the 
cache line containing the target is already in one of the IB 
40 buffers, the IB_LOC field for that entry is updated with the 
associated IB tag. 

IB_LOC is a logical pointer because the physical IB 
buffers to which the logical tags CURR/NEXT/PREV are 
assigned may change. 

Referring to FIGS. 3 and 4a, a short COF BTC hit 
sequences similarly to a normal BTC hit. When the hit is 
signaled, the BU provides the target IB tag from the 
IB__LOC field, identifying the location of the target as 
within the CURR, NEXT, or PREV IB buffer. The CURR/ 
NEXT/PREV IB buffer indicated by the target IB tag from 
IB__LOC is checked for validity, and if valid, the PFU 
inhibits a target prefetch request (by not gating the LI Cache 
output from the BTC onto the C-BUS)— until the COF 
decodes, the TARG IB tag is assigned to the IB buffer 
designated by IB_LOC (which may be the CURR, NEXT, 
or PREV IB buffer), allowing IB buffer reallocation to 
continue as the COF instruction bytes are transferred to the 
Decoder. If the contents of the designated IB buffer arc not 
valid, the PFU treats the COF as a normal BTC hit, and 
issues a prefetch request for the target prefetch block. 

Once the COF decodes, the PFU commences transferring 
target instruction bytes to the ID — the location of the initial 
byte of the target instruction within a designated IB buffer is 
given by the byte location number [3:0] stored in the BTC 
along with the LI Cache tag [11:4]. 1 Tie IB buffer tagged 
TARG is allocated as the CURR IB buffer, and, if necessary, 
the other IB buffers are reallocated accordingly. However, 
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the NEXT and PREV IB buffers are not cleared at this point 
(as they would be for a normal COF, because the sequence 
of prefetch blocks within the IB buffers has not been 
disturbed (even though the CURR/P REV/NEXT allocations 
may have changed) — as a result, the IB buffer holding 
instructions following new logical CURR IB buffer may still 
be used as the NEXT IB buffer if they are valid. The new 
PREV IB buffer instructions may be treated similarly. The 
prefetch address stored in pf_apath 304 is incremented by 



pointer to the first instruction of the not-predicted path 
within the N_PRED buffer. For branches (statically pre- 
dicted not-taken), this pointer is acquired from the lower 4 
bits of the target address as the target prefetch request is 
issued to the C-BUS by LDST. For LOOPs (statically 
predicted taken), this pointer is saved wben the COF finishes 
decode — at that point ID 211 indicates the byte position of 
the instruction following the COF over the id_shift lines. 
For cases in which both the predicted and not-predicted 



more than one prefetch block"* both the CURR and NEXT 10 f aths m b "ff rcd k J* 6 J 8 *?* thc MH IX and 1 1 FILL 

IB buffers retain valid instructions after the COF is decoded. ^T^S^^t t^^V^Zt^ 

_ ...... . • . which may return during the same cycle, ror instance, a 

Even though a detected short COF obv.ates a target difl ^ uest \ the (rjoMaken) path 

prefetch request, to prepare for prefetching beyond the past the COF which missed in the LI Cache may return data 

target, the prefetch registers in pf_apath 304 must be a , ^ ^ timc ^ the tafget prcfctch rcqucsl fof ^ CQF 

updated. Consequently, in response to the BTC hit, a target 15 from the \j) S j hits m me L1 Cache . ^ MFILL and HFILL 

prefetch is cycled through the PFU, except that the LI Cache tags are ia conjunction with the CURR, NEXT, and 

index is not gated onto the C-BUS as a prefetch request. The N_PRED tags to properly route the data in the IB buffers 

prefetch address tag in the designated IB buffer provides the IB1-IB3. 

high order bits [31:4] of the target address, while the initial Specifically, if a second COF misses in the BTC before 

instruction byte pointer for the target instruction within the 20 the first completes its EX stage, the N_PRED IB buffer for 

IB buffer is provided by the byte location number [3:0] the first COF is cleared, and the N_PRED tag and pointer 

driven onto the pf_ifa bus from the BTC in response to the are then updated for the second COF. Similarly, the PF only 

hit. buffers both paths of the most recent COF. Control logic in 

If a BTC hit results in a short COF match, and the IB pf_cbusctl 308 tracks the number of pending COFs, iden- 

buffer containing the short COF target is subsequently 25 tifying the target prefetch request from LDST for the most 

cleared before the short COF decodes, instruction prefetch is recent COF (i.e., in a multiple-COF situation, all but the 

halted upon decode of the short COF, and a misprediction is latest target prefetch requests are ignored), 

signaled when the short COF completes EX. Also, if the 4.5. Mispredicted COF 

NEXT-IP aliases to a short COF such that a BTC bit for the If a COF is mispredicted (i.e., the statically or dynami- 

short COF is signaled but no COF decodes, the IB buffer 30 cally predicted direction does not match the resolved direc- 



tagged TARG is not cleared because by definition the 
contents of that IB buffer derive from normal sequential 
prefetching — if the TARG assignment incorrectly resulted 
from aliasing, the TARG tag should be removed and sequen- 
tial instruction decode continued. 

For the exemplary embodiment, a short COF compare on 
an IB buffer is not valid if the IB buffer is already tagged 
TARG or N_PRED for a previous COF, because such a 
designation may not apply if the short COF is again encoun- 



tion available in EX), the execution pipe is flushed, and the 
PFU switches to the correct path of the COF. The predicted 
path instructions in the prefetch buffer are flushed, and 
prefetch resumes in the correct direction. In addition to 
35 mispredicted conditional COFs, other conditions such as 
BTC aliasing can force the BU to signal a mispredicted COF 
(even if the COF is unconditional). 

Referring to FIGS. 3 and 4a, in the case of a statically 
mispredicted COF that missed in the BTC, the prefetch 
tered. For example, if a normal COF that musses in the BTC 40 buffer 242 may already contain the first prefetch block in the 
and results in an IB buffer being tagged N_PRED is then not-predicted path, stored in the IB buffer designated by the 
followed closely by a short COF that hits in the BTC, it is N_PRED buffer tag. If this IB buffer is still valid at the time 
undesirable to recognize a short COF match— if the first of misprediction, the N_PRED buffer is reallocated as the 
COF resolves taken, and thc same code executes again, the new CURR IB buffer, with the N_PRED pointer used to 
first COF will then hit in the BTC and no N_PRED 45 identify the initial instruction byte of the not-predicted target 
assignment will be made, such that the following short COF or fall through instruction. The other IB buffers are reas- 
will now match on an IB buffer pointed to by IB_LOC that signed NEXT and PREV, and cleared, and the PFU corn- 
may not contain the same instructions (due to different IB mences prefetching into the NEXT IB buffer, 
buffer allocations depending on whether the N_PRED tag is If a misprediction results from a COF that hits in the BTC 
active). As a result, a short COF match would cause a target 50 or RSTK, or for a COF that misses in the BTC and the 
address mismatch (misprediction). N_PRED tagged IB buffer is not valid for some reason, then 

4.4. Not-Predicted Path the correct not-predicted path instructions are not available 

Referring to FIGS. 3 and 4a, for conditional COFs that in the IB buffers at the lime the misprediction is signaled, 
miss in the BTC, the PFU buffers the non-predicted path When the BU signals the misprediction, the entire prefetch 
(either the not-predicted taken path for branches or the 55 buffer is cleared, but the IB buffers are not reallocated, and 



not-predicted not-taken path for LOOPs). 

When a COF is decoded with no BTC hit, pf_adctl 306 
reallocates the IB tags for the IB buffers to buffer thc 
predicted and not-predicted paths — in particular, thc PREV 
IB buffer is used to buffer instructions in the not-predicted 
path (see, Section 4.2.1). When the COF decodes, the 
N_PRED tag is assigned to the physical IB buffer that has 
been allocated as PREV (for example, IB1 in FIG. 5a) until 
the COF resolves in EX (or another COF decodes), even if 
that physical IB buffer is reallocated. 

Similarly, in the pointer logic 314 of pf_dpalh 302, an 
N_PRED pointer is maintained which represents the byte 



sequential prefetch starts with the not-predicted address 
supplied by the BRB. 
5. Unit Checking 

The Prefetch unit includes mechanisms to ensure that 
60 segment and page boundaries are not crossed when the 
Branch unit supplies the prefetch target address. In 
particular, from the Background, to maintain compatibility 
with the 486 specification, instruction fetch (and decode) 
beyond a segment limit should result in a segment limit 
65 violation (exception). 

FIG. 13 illustrates segment limit checking for normal 
sequential prefetching, as well as the exemplary scheme for 
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detecting segment limit violations in the case of COFs that 
hit in the branch unit (BTC or RSTK). 
5.1. Segment Limit Checking 

Referring to FIG. 3, during sequential prefetching, PFU 
240 conventionally detects whether a prefetch address is for 
a prefetch block that will contain a segment limit — if so, 
then if sequential decoding continues to the point of the 
boundary, the PFU signals a segment limit exception. If a 
page boundary is encountered, the PFU must initiate an 
address translation via TLB 230 to get the physical address 
of the new page from which it can continue prefetching. 

The logic for maintaining the physical and linear prefetch 
addresses, and the segment limit address, and for segment 
and page limit checking, resides in pf_apaih 304. The PFU 
maintains both the linear and physical prefetch addresses LA 
and PA (see, Section 2.3) — the lower 12 bits of the IA and 
PA are identical. 

Referring to FIG. 12, the pf_apath logic includes three 
registers used in segment limit checking: (a) CSLA 361 
holds the linear address of the segment limit, (b) PRLA362 
holds high order 20 bits [31:12] of the linear prefetch 
address, and (c) PRPA 363 holds the physical prefetch 
address [31:0], including the low order 12 bits [11:0] that are 
the same for the linear address. CSLA is loaded from the 
AC_LABUS (FIG. 3) any time a new code segment is 
entered. 

The high order 20 bits of PRLA are obtained as follows: 
(a) during sequential prefetching, if the PRPA address is 
incremented and a carry-out occurs from the 12th bit, the 
PRLA is incremented, (b) if a COF is encountered and 
predicted taken, the PRLA is loaded with the target linear 
address calculated when the COF reaches AC and supplied 
over the AC_LABUS, and (c) if a conditional COF is 
resolved mispredicted, AC 212 provides the correct linear 
address. 

During sequential prefetching, prior to issuing a prefetch 
request, the linear code segment limit address in CSLA is 
compared (365) with the linear prefetch address formed by 
the high order PFLA bits [31:12] and the low order PFPA 
bits [11:3] (i.e., 16-byte prefetch block granularity). If a 
match occurs (366), the segment limit is known to be 
somewhere in the prefetch block — when the prefetch block 
is loaded into an IB buffer, the lower bits [3:0] of the 
segment limit address are used to mark the valid bytes that 
precede the boundary by setting (367) the appropriate seg- 
ment valid bits SV and segment limit bits SL (see, Section 
4.1). 

If a prefetch block contains a segment limit, the PFU stops 
prefetching. If a taken COF is encountered, or if an excep- 
tion is encountered, prefetching resumes at the target or 
exception handler. Otherwise, normal sequential decode will 
proceed, with the PFU sequencing through the IB buffers as 
it delivers instruction bytes to ID 211 until the PFU detects 
that (a) the CURR IB buffer contains a segment limit (SL 
set), and (b) all valid bytes in the IB buffer have been 
transferred to the decoder — at this time, a segment limit 
exception will be signaled. 

The exemplary PFU docs not include independent lincar- 
to-physical address translation capability (relying instead on 
AC/TLB). However, in the case of a BTC/RSTK hit, the 
PFU will not have the corresponding target linear address 
when the target prefetch block returns (available only when 
the COF reaches AC). Thus, the PFLA register will be 
invalid for the returned prefetch block. If the target address 
has jumped into a cache line containing the segment limit in 
the CSLA register, the PFU will not be able to perform a 
comparison to detect a segment limit violation. 
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Referring to FIG. 3, COFs that miss in the BU do not 
present a problem. When the COF reaches AC 212, the 
target linear address is calculated, and if paging is enabled, 
the TLB 230 performs linear-to -physical translation. The 

5 target LA and PA are supplied to the LDST 260 for prefetch- 
ing the target address as either the predicted or not-predicted 
path of the COF, and to the PFU to update its PFIA and PFPA 
if the COF is predicted taken — if the COF resolves 
mispredicted, the PFU will correspondingly update the 

io PFLA and PFPA- The PFU performs a segment limit check 
when the PFLA is supplied during AC1, and the results of 
the comparison are saved until the target prefetch block 
returns. 

Far COFs (which will always miss in the BU) are not a 

5 problem, either, because the CSLA register will be updated 
before the target prefetch request is issued from LDST 260. 

Referring to FIG. 12, in the case of COFs that hit in the 
BTC/RSTK 252/342, the exemplary PFU issues the target 
prefetch request, and performs a CSLA segment limit com- 

:o parison (371) using (a) bits [11:3] of the target address 
supplied by the BTC or RSTK, and (b) bits [11:3] of the 
existing code segment limit address in CSIA 361. If no 
match occurs, prefetching continues — any subsequent 
sequential prefetches prior to target linear address availabil- 

:5 ity from the AC stage 212 will also only use bits [11:3] of 
the physical prefetch address for segment limit comparison. 

If a CSLA match occurs (372) with bits [11:3] of a 
BTB/RSTK target prefetch address, or any of the ensuing 
sequential prefetch, prefetching stops, and a potential seg- 

0 ment limit violation condition is detected (373). The instruc- 
tion bytes of the returned prefetch block arc marked as if a 
segment limit violation was detected, i.e., the SLand SV bits 
for the prefetch block are appropriately set (374). The 
Prefetch unit will continue transferring instruction bytes to 

5 the Decoder up to the potential segment limit, and then stall. 
When the COF reaches AC 212, and the target linear 
address is calculated, the upper linear address bits [31:12] of 
the last prefetch request become available for CSLA com- 
parison (375) to determine if an actual segment limit vio- 

0 lation occurred. If the CSIA upper bit comparison matches 
(376), the PFU will continue delivering instruction bytes to 
the Decoder up to the segment limit and then signal a 
segment limit exception (or if Prefetch transfer has already 
reached the segment limit, immediately signal a segment 

5 limit violation). If the CSIA upper bit comparison 
mismatches, state information indicating a potential segment 
limit violation is cleared (377), and sequential prefetch 
continues (i.e., the SL bits in all IB buffers are reset, and the 
SV bits are all set). 

0 Referring to FIG. 3, when a conditional COF is 
mispredicted, a similar problem exists. The exemplary BU 
250 stores physical addresses in its BRB (branch resolution 
buffer) — for COFs, the exemplary AC unit 212 maintains a 
copy of the linear address for both the taken and not taken 

5 paths of a conditional COF. 

A misprediction is signaled when the COF resolves in EX, 
and in that same cycle the PFU either (a) issues a prefetch 
request using the not-predicted COF address from the BRB, 
or, (b) if the not-predicted path is already buffered in an IB 

o buffer, the PFU switches to the not-predicted path in trans- 
ferring instruction bytes to the Decoder. However, the exem- 
plary AC unit does not provide the linear address of the 
proper instruction path until the next cycle, which is too late 
to check whether the segment boundary lies in the first 

5 8-byte transfer to the Decoder. 

No problem arises if the not-predicted path is already 
buffered at the lime of misprediction, because segment limit 
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checking would already have been done when that prefetch 
block was fetched. Otherwise, a CSLA comparison must be 
performed on bits [11:3] of the physical misprediction 
prefetch address from the BRB — a partial match condition is 
noted, and then tested when the full linear address is 
available from AC in the next cycle. 

One other segment limit case requires special handling in 
the exemplary PFU. If a predicted-taken COF resides at the 
segment limit, and the COF resolves not -taken, the segment 
limit is violate*! — if the COF also resides at the end of a 
cache line, then the address supplied by BRB for mispre- 
diction prefetch request will be to the following cache line. 
Consequently, if the PFU uses bits [11:3] of the physical 
address from BRB for CSLA comparison, then a match 
condition will be missed. 

The exemplary PFU and BU handle this situation by 
detecting when a COF resides at the segment limit during 
normal segment limit checking. The BU is signaled that such 
is the case, and this information is stored in the BRB with 
other information about the COF. If the COF was predicted- 
taken, and resolves not-taken, the BU signals the PFU of the 
special case, which will then signal a segment limit viola- 
tion. 

5.2. Page Limit Checking 

When prefetching encounters a page limit, the PFU ini- 
tiates a TLB lookup to generate a new physical prefetch 
address. That is, when a page limit is encountered, even if 
the segment limit has not been reached, it is not known 
whether the next linear page is sequentially placed in 
physical memory. In a manner similar to the detection of 
segment limits, the PFU must determine whether a prefetch 
request contains a page limit — such detection is made easier 
by the fact that page boundaries are aligned to 16-byte cache 
line (prefetch block) boundaries. 

Referring to FIGS. 3 and 4a, for normal sequential 
prefetching, when a prefetch request is issued, the physical 
address is incremented and used to update the PRPA register 
in preparation for the next sequential prefetch (see, Section 
2.3). Because page size is fixed at 4K bytes, if incrementing 
the physical address of an outgoing prefetch request causes 
a carry -out from the 12th bit position, a page boundary exists 
at the end of the prefetch block being fetched — when the 
prefetch block returns and is loaded into an IB buffer, the PL 
bit is set (see, FIG. 4b and Section 2.1). 

When a page limit condition is detected, the PFU will then 
initiate a TLB access with the linear address for the next 
prefetch from PR LA— this PRLA address will point to the 
new page since its value was updated as the prefetch request 
issued. In the exemplary processor, the PFU must compete 
with the AC unit 212 for access to the TLB, and the PFU is 
given lower priority — thus, it may be several cycles before 
the PFU access is granted. If the PFU access hits in the TLB, 
the LDSTunit 260 issues the prefetch request using the new 
translated physical address (which is also supplied to the 
PFU for updating the PFRA register). 

If the PFU access misses in the TLB, the instruction fetch 
is aborted by LDST, and the PFU idles pending reaching the 
page boundary. The LDST prefetch request is aborted for 
performance reasons — a tablewalk is avoided until it is 
known that the new page will be required. The PFU will 
continue to transfer instruction bytes to the ID 211 up to the 
page limit — if a taken COF or an exception is encountered, 
prefetching resumes at the COF target or exception handler 
(thereby obviating the tablewalk for the new page). 

If sequential prefetching proceeds into the IB buffer with 
its PL bit set, and instruction decode reaches the last byte of 
this buffer without a COF or exception, it is then known that 
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the page boundary will be crossed. The PFU will again 
access TLB for a translated physical address, and this time 
the tablewalk will occur. Sequential prefetching will con- 
tinue with the prefetch request from LDST. 

5 In the case of a predicted-taken COF, as in the case of 
segment limit checking, the target linear address is not 
immediately available after a BTB or RSTK hit. This 
address is required for the PFU to access the TLB to acquire 
the physical address for the new page. Consequently, after a 

10 BTC/RSTK hit, if a page limit is detected, the PFU halts 
prefetching until the COF reaches the AC stage and the 
target address is calculated. 

The exemplary PFU 240 includes logic in pf_apath 302 
to handle three special cases: (a) similar to segment limit 

15 checking, the case where a predicted-taken COF resides at 
the page boundary, and resolves to be not-taken, (b) the case 
where prefetching along the not-predicted path of a COF 
causes the PFU to access the TLB and that access misses, 
which would then cause a tablewalk that might not be 

20 necessary, and (c) a special case for 486 compatibility. 

Regarding predicted-taken COFs at a page boundary, the 
not-predicted address provided by the BRB will already 
have been incremented to the new prefetch block (and thus 
to the next sequential page frame), so the PFU would not 

25 detect a page limit violation. To handle this case, the PFU 
signals the BU when the COF is transferred to ID 211 that 
the page boundary has been reached — this information is 
saved in the BRB and, if the COF is predicted-taken and 
resolved not-taken, the PFU is signaled. The PFU will then 

30 abort the misprediction prefetch, and use the linear address 
for the not-predicted path provided by AC to access the TB. 

Regarding the case where prefetching along the not- 
predicted path of a COF causes the PFU to access the TLB 
and that access misses, for performance reasons, the exem- 

35 plary PFU suppresses the tablewalk. The COF continues into 
EX as if its target address had hit in the TLB. If the COF is 
mispredicted, the BU signals the PFU, which uses the linear 
address from AC to access the TLB for the proper target 
physical address — the PFU will abort its own misprediction 

40 fetch, and allow the prefetch request to issue from LDST 
after TLB translation. 

Regarding the 486 compatibility issue, when the PG or PE 
bit is altered in CR0 it is possible that the translation of the 
page from which execution is currently occurring can be 

45 changed. According to the conventional 486 specification, if 
this occurs, the change does not take effect until either: (a) 
instruction decode reaches the end of the cache line 
(prefetch block) containing the write to CR0 that changed 
either the PG or PE bit, or (b) a taken COF is encountered. 

50 For the exemplary processor, the TLB signals the PFU 
whenever the PG or PE bit is altered — the PFU sets an 
internal state which causes the PFU to initiate a TLB lookup 
in prefetching the next prefetch block beyond the one which 
produced the write to CR0. If a COF is encountered, proper 

55 translation of target and not-taken addresses should fall out. 
6. Detecting Self-Modifying Code 
From the Background, the 486 specification on self- 
modifying code provides that to guarantee that the modifi- 
cation of an instruction takes place before the instruction is 

60 dispatched for execution, a write that modifies an instruction 
should be immediately followed by a JMP to that instruc- 
tion. Significant complexities are introduced in handling 
self-modifying code because of existing code that does not 
follow the 486 specification, and by the use of branch 

65 processing and store buffering. 

Referring to FIGS, la, 3, and 4a, the exemplary processor 
200 detects self-modifying code in three ways: (a) for 
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compatibility with the 486 specification, the LDST unit tion (417), and the PFU will respond by re-fetching the 

detects when a store is followed by a COF that hits in the target, this time including the modified instructions. 

BTC/RSTK, where the store address and the target address Finally, if a store matches either the CURR or NEXT IB 

(supplied by the BTC/RSTK) are the same, (b) the PFU 240 buffer, both are cleared (421). When the LDST signals that 

snoops the OBUS for stores from the LDST 260, and 5 the store is complete (422), the PFU re-fetches (423) the 

compares the store address to (i) each of the IB buffer cleared prefetch block so that decode can proceed' — the 

prefetch address tags (see, FIG. 4a), and (ii) the addresses of address tag associated with the CURR IB buffer is used as 

any of its pending prefetch requests, and (c) the LDST unit the prefetch address, and sequential execution commences 

snoops prefetch addresses issued by the PFU, and compares once this prefetch request issues and the pf_apath logic 

any prefetch address issued by the PFU or the LDST to 10 updates its registers. 

addresses of active stores queued in the store reservation Referring to FIG. 14c, in the case where the LDST unit 

stations 262. detects prefetch requests (431) that match (432) pending 

FIGS. 14a-14c illustrate the exemplary scheme for (queued) store addresses in LDST reservation stations, the 

delecting self-modifying code using respectively (a) an PFU is signaled (435). The PFU prevents the returned data 

implementation of the 486 jump/store specification for JMPs 15 from being delivered to ID (436). When the LDST unit 

that hit in the BTC, (b) Prefetch unit snooping of store signals that the queued stores have been retired (437), the 

addresses for comparison with pending prefetch requests PFU retries the prefetch request. 

and prefetch blocks already in the prefetch buffer blocks, If the prefetch request misses in the LI Cache, the LDST 

and (c) LDST unit snooping of prefetch requests for com- sn0 op signal arrives too late to abort the prefetch request 

parison with stores queued in the LDST store reservation 20 from going off-chip — the PFU waits until the external bus 

stations. cyc le completes, and then clears (436) the returned prefetch 

Referring to FIG. 14a, for compatibility with the 486 block, 

specification, the LDST unit compares the most recent store j % Conclusion 

address with the target address of a COF that hits in the Although the Detailed Description of the invention has 

BTC/RSTK, such that the target instruction may have fol- 25 been directed to certain exemplary embodiments, various 

lowed the COF into the execution pipe. Specifically, the modifications of these embodiments, as well as alternative 

LDST is notified when a store reaches AC (381)— the store embodiments, will be suggested to those skilled in the art. 

address from the AC/TLB is latched (382) in LDST, which In particular> while the implementation of the invention 

then allocates a reservation station. The LDST maintains the for detecting self-modifying code in the context of branch 

latched store address ina separate register-if the LDST is 30 processing and store 5uffering has oeeD described ^Ih 

signaled by AC that a COF that hit in the BTC has followed respect tQ „ exemplarv processor architecture and computer 

the store into the executton pipe (384), the LDST compares system configuration> the invention has general application 

(386) the latched store address with the target address from t0 the delection of ^^0,,^ ^ in a pipelined 

^ C ^A^ eS ^? dC0F ^ processor with branch processing, thereby enhancing 

the BU (387) which forces a COF mispredict signal (388) 35 prefetching operations, and execution pipeline performance, 

when the COF reaches EX (flushing an unmodified target m other pressor architectures, and for other system appli- 

instruction that may be behind it). cations 

Referring to FIG. 14/>, in the case of PFU snooping when Fof ' ^ dfic ^ stmctureSt mappmgSj bit 

a store to an instruction is detected (391), the PFU must assignmcnts> and other implementation details are set forth 

insure that the updated instruction bytes are delivered to the 40 M for of jdi detailed descri lion of the 

ID stage. Consequently, the PFU detects (392) cases where invention 

a store^U affect either (a) instruction bytes currently stored M references tQ div dflta 

in its IB buffers p3) or (b) the instruction bytes it is in the words (dwofds) ^ ( } > wheQ uscd 

process of fetching (394)-in those cases, the altered code ^ da > afC no ? mtendcd ^ Umi / m M {Q thc si bm 

is flushed and prefetching begins. 45 ratb are inleQded lQ as rfc ^ for b]ocks f 

In the case of stores that affect pendmg prefetch requests, dflta 

the PFU snoops the C-BUS and compares store addresses to ,. c . . . . «■ 

/. . A ; . , ,. . Moreover, various modifications based on trade-offs 

pending prefetch requests — a match mdicates that thc . . , ' . f , . ... . 

prefetch block to be returned contains code that will be * tTO ^ a ?™ ^ *™ 

,. , those skilled in the art. 

modified by the store. The pendmg prefetch request is 50 ™ . , 

marked, and, when it completes, the data is cleared (401). ,. ^ ^ ntlon f e ^??T^ , modlficallo f h or ^ tcraa - 

When the LDST signals (402) that the store is complete, the ^e embodiments that fall within the scope of the Claims, 

prefetch request is conditionally retried (403). We claim. . , . , . . 

In the case of stores that affect code in its IB buffers, if the * A processor implementmg a scheme for detecting 

PFU snoops a store which hits one of the IB buffers (i.e., 55 self-modifying code where the processor supports branch 

matches to the prefetch address tag for the IB buffer), the Passing, and where at least some seff- modifying code is 

action taken depends on IB buffer allocation (411). If the IB characlenzed by a store instruction that modifies a target 

buffer is tagged as cither PREV or the N_PRED, it is simply inst ™ cUon followed by a jump instruction to jump to such 

cleared (412)-neither of these virtual buffers is needed for modlfied tar S et instruction, comprising: 

proper functioning, only for performance enhancement. If 60 ( a ) a prefetch unit that issues prefetch addresses for 

the IB buffer is tagged TARG, it is also cleared (415), but the prefetch blocks of instruction bytes, and loads prefetch 

BU is signaled (416) that the target instructions for the BTC blocks into a prefetch buffer for transfer to a decoder; 

hit underway have been nullified — when the COF decodes, (b) a branch target cache (BTC) that for each of selected 

the TARG IB buffer will be allocated as CURR although, COF (change -of- flow) instructions provides predicted 

TARG having been cleared, no instructions will be valid. As 65 target address information used to generate a prefetch 

a result, instruction decode will stall until the COF com- address for a prefetch block including a corresponding 

pletes in EX, at which time the BU will signal a mispredic- predicted target address: and 
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(c) store control logic responsive to a store instruction 
being decoded to latch the associated store address at 
least until the next instruction has completed decoding; 

(d) the store control logic includes detection logic that 5 
detects whether the next instruction is a jump instruc- 
tion that hits in the BTC such that the BTC supplies 
predicted target information for the jump; and 

(e) the store control logic includes comparison logic that 
compares the store address and the actual target address 10 
obtained from decoding the jump instruction, and if 
they match, signals a code modification condition. 

2. The processor of claim 1, wherein the store control 
logic comprises a load/store unit. 

3. The processor of claim 1, further comprising: 

(a) prefetch snoop logic in the prefetch unit that detects 
store addresses; 

(b) for each store address detected by such prefetch snoop 
logic, the prefetch unit determines whether a prefetch 
block match exists between the store address and either 
(i) an address included within a prefetch block for 
which a pending prefetch address has been issued but 

a corresponding prefetch block not yet stored in the 2 5 
prefetch buffer, or (ii) an address included within a 
prefetch block already stored in the prefetch buffer; 

(c) for each store address for which a prefetch block 
match is delected, the prefetch unit (i) inhibits instruc- 
tion bytes in the corresponding prefetch block from 30 
being transferred to the decoder, and (ii) re-issues a 
prefetch address for such' prefetch block after the 
associated store operation is complete. 

4. The processor of claim 1, wherein the store control 
logic includes store reservation stations that queue pending 
store addresses for multiple store operations, further com- 
prising: 

(a) store snoop logic in the store control logic that detects 
prefetch addresses issued by the prefetch unit; 

(b) for each prefetch address detected by such store snoop 
logic, the store control logic compares such prefetch 
address with pending store addresses, and in the case of 
a match, signals a code modification condition; 

(c) in response to the code modification condition, the 
prefetch unit (i) inhibits instruction bytes in the corre- 
sponding prefetch block from being transferred to the 
decoder, and (ii) re-issues a prefetch address for such 
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(c) store control means for latching, in response to a store 
instruction being decoded, the associated store address 
at least until the next instruction has completed decod- 
ing; 

(d) the store control means detecting whether the next 
instruction is a jump instruction that hits in the BTC 
such that the BTC supplies predicted target information 
for the jump; and 

(e) the store control means comparing the store address 
and the actual target address obtained from decoding 
the jump instruction, and if they match, signaling a 
code modification condition. 

6. The processor of claim 5, wherein the store control 
means comprises a load/store unit. 

7. The processor of claim 5, further comprising: 

(a) prefetch snoop means for detecting store addresses; 

(b) for each store address detected by such prefetch snoop 
means, the prefetch means determining whether a 
prefetch block match exists between the store address 
and either (i) an address included within a prefetch 
block for which a pending prefetch address has been 
issued but a corresponding prefetch block not yet stored 
in the prefetch buffer, or (ii) an address included within 
a prefetch block already stored in the prefetch buffer; 

(c) for each store address for which a prefetch block 
match is detected, the prefetch means (i) inhibiting 
instruction bytes in the corresponding prefetch block 
from being transferred to the decoder, and (ii) 
re-issuing a prefetch address for such prefetch block 
after the associated store operation is complete. 

8. The processor of claim 5, wherein the store control 
means includes store reservation stations that queue pending 
store addresses for multiple store operations, further com- 
prising: 

(a) store snoop means for detecting prefetch addresses 
issued by the prefetch means; 

(b) for each prefetch address detected by such store snoop 
means, the store control logic comparing such prefetch 
address with pending store addresses, and in the case of 
a match, signaling a code modification condition; 

(c) in response to the code modification condition, the 
prefetch means (i) inhibiting instruction bytes in the 
corresponding prefetch block from being transferred to 
the decoder, and (ii) re-issuing a prefetch address for 
such prefetch block after the pending store operations 
are complete. 

9. A method for detecting self-modifying code imple- 
mented in a processor that supports branch processing, 



prefetch block after the pending store operations are 50 wnere at least some self-modifying code is characterized by 

a store instruction that modifies a target instruction followed 
by a jump instruction to jump to such modified target 
instruction, comprising the steps: 

(a) issuing prefetch addresses for prefetch blocks of 
instruction bytes, and loading prefetch blocks into a 
prefetch buffer for transfer to a decoder; 

(b) outputting, for selected COF (changc-of-flow) 
instructions, predicted target address information that is 
used to generate a prefetch address for a prefetch block 
including a corresponding predicted target address; 

(c) latching, in response to a store instruction being 
decoded, the associated store address at least until the 
next instruction has completed decoding; 

(d) detecting whether the next instruction is a jump 
instruction results in the output of corresponding target 
address information used to generate a prefetch address 
for a corresponding target prefetch block; and 



complete. 

5. A processor implementing a scheme for detecting 
self-modifying code where the processor supports branch 
processing, and where at least some self-modifying code is 
characterized by a store instruction that modifies a target 55 
instruction followed by a jump instruction to jump to such 
modified target instruction, comprising: 

(a) prefetch means for issuing prefetch addresses for 
prefetch blocks of instruction bytes, and for loading 
prefetch blocks into a prefetch buffer for transfer to a 
decoder, 

(b) branch target means for providing, for each of selected 
COF (change-of-flow) instructions, predicted target 
address information used to generate a prefetch address 65 
for a prefetch block including a corresponding pre- 
dicted target address; and 
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(e) comparing ihe store address and the actual target 
address obtained from decoding the jump instruction, 
and if they match, signaling a code modification con- 
dition. 

■10. The method of detecting self-modifying code of claim 5 
9, further comprising: 

(a) detecting store addresses; 

(b) for each store address detected, determining whether 
a prefetch block match exists between the store address 
and either (i) an address included within a prefetch 10 
block for which a pending prefetch address has been 
issued but a corresponding prefetch block not yet stored 

in the prefetch buffer, or (ii) an address included within 
a prefetch block already stored in the prefetch buffer; 

(c) for each store address for which a prefetch block 
match is detected, (i) inhibiting instruction bytes in the 
corresponding prefetch block from being transferred to 
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the decoder, and (ii) re-issuing a prefetch address for 
such prefetch block after the associated store operation 
is complete. 

11. The method of detecting self-modifying code of claim 
further comprising the steps: 

(a) queuing pending store addresses for multiple store 
operations; 

(b) detecting prefetch addresses that match pending store 
addresses, and in the case of a match, signaling a code 
modification condition; 

(c) in response to the code modification condition, (i) 
inhibiting instruction bytes in the corresponding 
prefetch block from being transferred to the decoder, 
and (ii) re-issuing a prefetch address for such prefetch 
block after the pending store operations are complete. 

***** 
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